suppose we've obtained a [[prediction rule]] $f : \mathcal{X} \to \mathcal{Y}$. how well does it do on *unseen* inputs? ie what's the [[mean|expected value]] when we evaluate the [[loss function]] over the true [[statistics|data generating process]]: $ \text{Err}_{\text{g}}(f) = \mathop{\mathbb{E}}_{(\boldsymbol{x}, y) \sim p_{*}}[\mathrm{loss}[y](f(\boldsymbol{x}))]. $ ^test - eg the test error using the [[zero-one loss]] is the [[indicator|misclassification probability]]. many ways to estimate this quantity. see [[model assessment]] suppose $f$ is learned from some training [[dataset]] $\mathcal{D} \sim p_{*}^{N}$ (eg by [[train loss|erm]]). we say it *generalizes* if it performs well on *unseen* data (low test error). *generalization* is thought to be a core component of [[intelligence]]. compare with [[overfit]]ting / [[human memory|memorization]] Not to be confused with [[risk]] which evaluates the [[optimization algorithm]] (not the specific [[prediction rule]]) # sources [Generalization error - Wikipedia](https://en.wikipedia.org/wiki/Generalization_error) [[2013HastieEtAlElementsStatisticalLearning|ESL]]