Consider a [[discrete random variable]] $p \in \Delta(\mathcal{X})$. Then the [[mean|average]] [[surprise]] is called the **entropy**: $ \begin{align*} H(p) &:= \mathop{\mathbb{E}}_{x \sim p}[ - \log p(x)]\\ &= \sum_{x \in \mathcal{X}} p(x) (- \log p(x)) \\ \end{align*} $ ^entropy Easily extended to [[conditional entropy]] and then the notion of [[mutual information]] # empirical average ([[statistic]]) Note given $N$ samples from $X$ we can then estimate the entropy using the [[sample mean]] surprise $\hat{H}(x_{1:N}) = \frac{1}{N} \sum_{n \in [N]} - \log \Pr(x_{n}).$ - Note that if the $N$ samples are from $Y$ but we evaluate them using $X$ then this actually gives the [[cross entropy]]. See [[probabilistic interpretation of loss function]] # axiomatic definition We can also characterize entropy using the following axioms: - **uniform has max (marginal) entropy**: when $\mathcal{X}$ is finite, $H(X) \le \log |\mathcal{X}|$, with equality iff $X$ is uniform over $\mathcal{X}$. - **chain rule**: $H(X, Y) = H(X) + H(Y \mid X)$, that is, we can convey the values of $X$ and $Y$ by first conveying $X$ and then $Y$. - **conditioning lowers (or preserves) entropy**: $H(Y \mid X) \le H(Y)$. # sources [[2013HastieEtAlElementsStatisticalLearning]] eq 7.6 [[STAT 110]] 10.1.7, 10.1.9 [[STAT 111]] chap 4.3 https://www.math3ma.com/blog/a-new-perspective-of-entropy - [bradley\_spring22.pdf](https://math3ma.institute/wp-content/uploads/2022/02/bradley_spring22.pdf) [[2015OlahVisualInformationTheory]] [[COMPSCI 229br]]