Consider a [[discrete random variable]] $p \in \Delta(\mathcal{X})$. Then the [[mean|average]] [[surprise]] is called the **entropy**:
$
\begin{align*}
H(p) &:= \mathop{\mathbb{E}}_{x \sim p}[ - \log p(x)]\\
&= \sum_{x \in \mathcal{X}} p(x) (- \log p(x)) \\
\end{align*}
$
^entropy
Easily extended to [[conditional entropy]] and then the notion of [[mutual information]]
# empirical average ([[statistic]])
Note given $N$ samples from $X$ we can then estimate the entropy using the [[sample mean]] surprise
$\hat{H}(x_{1:N}) = \frac{1}{N} \sum_{n \in [N]} - \log \Pr(x_{n}).$
- Note that if the $N$ samples are from $Y$ but we evaluate them using $X$ then this actually gives the [[cross entropy]].
See [[probabilistic interpretation of loss function]]
# axiomatic definition
We can also characterize entropy using the following axioms:
- **uniform has max (marginal) entropy**: when $\mathcal{X}$ is finite, $H(X) \le \log |\mathcal{X}|$, with equality iff $X$ is uniform over $\mathcal{X}$.
- **chain rule**: $H(X, Y) = H(X) + H(Y \mid X)$, that is, we can convey the values of $X$ and $Y$ by first conveying $X$ and then $Y$.
- **conditioning lowers (or preserves) entropy**: $H(Y \mid X) \le H(Y)$.
# sources
[[2013HastieEtAlElementsStatisticalLearning]] eq 7.6
[[STAT 110]] 10.1.7, 10.1.9
[[STAT 111]] chap 4.3
https://www.math3ma.com/blog/a-new-perspective-of-entropy
- [bradley\_spring22.pdf](https://math3ma.institute/wp-content/uploads/2022/02/bradley_spring22.pdf)
[[2015OlahVisualInformationTheory]]
[[COMPSCI 229br]]