Observe some dataset $\boldsymbol{X} \in \mathcal{X}^{N}$. Want to do [[maximum likelihood estimation]] $ p^{\text{mle}} = \arg\max_{ p \in \mathcal{P} } \log p(\boldsymbol{X}) $ where the [[statistical model]] $\mathcal{P} \subseteq \triangle(\mathcal{Z}, \mathcal{X})$ assumes a [[latent variable]] factorization and we apply [[law of total probability|lotp]] by abuse of notation. # deriving the elbo maybe $\mathcal{Z}$ is large / [[high-dimensional learning|high-dimensional]] and hard to integrate over. so [[importance sampling]]: for any $p \in \triangle(\mathcal{Z})$: $ p(x) = \mathop{\mathbb{E}}_{ z \sim q } \frac{p(z, x)}{q(z)} $ but this [[point estimator]] can have infinite variance. [[Jensens inequality]]: $ \log p(x) \ge \mathop{\mathbb{E}}_{ z \sim q } \log \frac{p(z, x)}{q(z)} $ this is called the "evidence lower bound" $\text{elbo}[x](p, q)$. note the identity $ \mathop{\mathbb{E}}_{ z \sim q } \log \frac{p(z, x)}{q(z)} = \log p(x) - \text{kl}(q \parallel p[x]) $ ^identity where $p[x] \in \triangle(\mathcal{Z})$ denotes the [[posterior]]. or alternatively in terms of the [[prior]] $\mathop{\mathbb{E}}_{ z \sim q } \log \frac{p(z, x)}{q(z)} = \mathop{\mathbb{E}}_{ z \sim q } \log p[z](x) - \text{kl}(q \parallel \pi)$ ^identity-prior maximizing it is a [[bilevel optimization]] problem: $ p^{\text{elbo}} = \arg\max_{ f, q } \sum_{n} \text{elbo}[x_{n}](f, q_{n}) $ # expectation maximization 1. "expectation": fix $f$ and maximize over $\boldsymbol{q}$ by computing the [[posterior]] distribution of the latent variables (to minimize [[Kullback-Leibler divergence|kld]]) - simple when $f$ is a [[conjugate prior|conjugate pair]] otherwise need [[variational inference]] 1. "maximization": fix $\boldsymbol{q}$ and maximize over the [[parameter]]s of $f$ via [[gradient descent]]