2014KingmaWellingAutoencodingVariationalBayes

[[1991KramerNonlinearPrincipalComponent|autoencoder]] is a pair of deterministic mappings $\text{enc}:x \mapsto z$ and $\text{dec}:z \mapsto \hat{x}$. not a [[generative model]]; [[pullback and pushforward|pushforward]] of $p_{*}$ could be arbitrary. ![autoencoder latent space (points scattered with lots of space between them) vs VAE latent space (points all clustered near the center)](https://i.imgur.com/JiFKvSq.png) *Image from [Machine Learning at Berkeley Blog](https://ml.berkeley.edu/blog/posts/vq-vae/)* in [[variational autoencoder]]: now $\text{enc} : \mathcal{X} \to \triangle(\mathcal{Z})$ and $\text{dec} : \mathcal{Z} \to \triangle(\mathcal{X})$, and we [[regularization term]] so that $\text{enc}[x]$ becomes similar to some fixed prior $\pi \in \triangle(\mathcal{Z})$. now we do [[variational inference]]: $\text{enc}[x] \rightsquigarrow p_{*}[x]$. we want to support all of $p_{*}[x]$ (seek the mean) so we optimize the [[forward vs reverse kld|reverse kl]] $\text{kl}(\text{enc}[x] \parallel p_{*}[x])$. recall the [[evidence lower bound|elbo]] ![[evidence lower bound#^identity-prior]] substituting $q =\text{enc}[x]$ and $p(z, x) = \pi(z)\text{dec}[z](x)$: first term maximizes reconstruction likelihood (typically estimate with one sample): second term makes $\text{enc}[x]$ similar to $\pi$ sometimes assume $\text{range} \,\text{enc}$ is the subset consisting of product distributions: aka [[mean field]] https://colab.research.google.com/drive/1v0UiRwUiBi4IoZKXXnwZwHpVDsUIoeg0?usp=sharing # sources Wikipedia pages for [Evidence lower bound](https://en.wikipedia.org/wiki/Evidence_lower_bound), [Variational Bayesian methods](https://en.wikipedia.org/wiki/Variational_Bayesian_methods), [Variational autoencoders](https://en.wikipedia.org/wiki/Variational_autoencoder) [Eric Jang: A Beginner's Guide to Variational Methods: Mean-Field Approximation](https://blog.evjang.com/2016/08/variational-bayes.html) - Helpful visualization of forward vs reverse KL divergence. [From Autoencoder to Beta-VAE | Lil'Log](https://lilianweng.github.io/posts/2018-08-12-vae/) - Goes further into depth on related architectures including denoising / sparse / contractive autoencoders and later research including $\beta$-VAEs, vector quantized VAEs, and temporal difference VAEs. [Princeton lecture notes from Professor David Blei](https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf) ^princeton-notes - Very in-depth and focuses on the optimization algorithms, which I've waved away in this post under the umbrella of "gradient ascent". - Walks through a concrete example of a simple distribution whose posterior is hard to calculate: a mixture of Gaussians where the centroids are drawn from a Gaussian. - Describes an improvement when $\mathcal{F}_{Z}$ is such that the distribution of each element, conditional on the others and on $x$, belongs to an exponential family. [Tutorial - What is a variational autoencoder? – Jaan Altosaar](https://jaan.io/what-is-variational-autoencoder-vae-tutorial/) - Great tutorial that draws the distinction between the deep learning perspective and the probabilistic model perspective - [GitHub - altosaar/variational-autoencoder: Variational autoencoder implemented in tensorflow and pytorch (including inverse autoregressive flow)](https://github.com/altosaar/variational-autoencoder) [Melody Mixer by Torin Blankensmith, Kyle Phillips - Experiments with Google](https://experiments.withgoogle.com/melody-mixer) [Speech Interaction Technology at Aalto University / NSVQ · GitLab](https://gitlab.com/speech-interaction-technology-aalto-university/nsvq) [Improving Variational Inference with Inverse Autoregressive Flow](https://arxiv.org/pdf/1606.04934.pdf)