suppose we've obtained a [[prediction rule]] $f : \mathcal{X} \to \mathcal{Y}$.
how well does it do on *unseen* inputs?
ie what's the [[mean|expected value]] when we evaluate the [[loss function]] over the true [[statistics|data generating process]]:
$
\text{Err}_{\text{g}}(f) = \mathop{\mathbb{E}}_{(\boldsymbol{x}, y) \sim p_{*}}[\mathrm{loss}[y](f(\boldsymbol{x}))].
$
^test
- eg the test error using the [[zero-one loss]] is the [[indicator|misclassification probability]].
many ways to estimate this quantity. see [[model assessment]]
suppose $f$ is learned from some training [[dataset]] $\mathcal{D} \sim p_{*}^{N}$ (eg by [[train loss|erm]]).
we say it *generalizes* if it performs well on *unseen* data (low test error). *generalization* is thought to be a core component of [[intelligence]]. compare with [[overfit]]ting / [[human memory|memorization]]
Not to be confused with [[risk]]
which evaluates the [[optimization algorithm]] (not the specific [[prediction rule]])
# sources
[Generalization error - Wikipedia](https://en.wikipedia.org/wiki/Generalization_error)
[[2013HastieEtAlElementsStatisticalLearning|ESL]]