Observe some dataset $\boldsymbol{X} \in \mathcal{X}^{N}$.
Want to do [[maximum likelihood estimation]]
$
p^{\text{mle}} = \arg\max_{ p \in \mathcal{P} } \log p(\boldsymbol{X})
$
where the [[statistical model]] $\mathcal{P} \subseteq \triangle(\mathcal{Z}, \mathcal{X})$ assumes a [[latent variable]] factorization
and we apply [[law of total probability|lotp]] by abuse of notation.
# deriving the elbo
maybe $\mathcal{Z}$ is large / [[high-dimensional learning|high-dimensional]] and hard to integrate over.
so [[importance sampling]]: for any $p \in \triangle(\mathcal{Z})$:
$
p(x) = \mathop{\mathbb{E}}_{ z \sim q } \frac{p(z, x)}{q(z)}
$
but this [[point estimator]] can have infinite variance. [[Jensens inequality]]:
$
\log p(x) \ge \mathop{\mathbb{E}}_{ z \sim q } \log \frac{p(z, x)}{q(z)}
$
this is called the "evidence lower bound" $\text{elbo}[x](p, q)$.
note the identity
$
\mathop{\mathbb{E}}_{ z \sim q } \log \frac{p(z, x)}{q(z)} = \log p(x) - \text{kl}(q \parallel p[x])
$
^identity
where $p[x] \in \triangle(\mathcal{Z})$ denotes the [[posterior]].
or alternatively in terms of the [[prior]]
$\mathop{\mathbb{E}}_{ z \sim q } \log \frac{p(z, x)}{q(z)} = \mathop{\mathbb{E}}_{ z \sim q } \log p[z](x) - \text{kl}(q \parallel \pi)$
^identity-prior
maximizing it is a [[bilevel optimization]] problem:
$
p^{\text{elbo}} = \arg\max_{ f, q } \sum_{n} \text{elbo}[x_{n}](f, q_{n})
$
# expectation maximization
1. "expectation": fix $f$ and maximize over $\boldsymbol{q}$ by computing the [[posterior]] distribution of the latent variables (to minimize [[Kullback-Leibler divergence|kld]])
- simple when $f$ is a [[conjugate prior|conjugate pair]] otherwise need [[variational inference]]
1. "maximization": fix $\boldsymbol{q}$ and maximize over the [[parameter]]s of $f$ via [[gradient descent]]