Preamble
Walt Disney Hall, |
Unfortunately, it is still thought in machine learning classes that overfitting can be detected by comparing training and test learning curves on the single model's performance. The origins of this misconception is unknown. Looks like an urban legend has been diffused into main practice and even in academic works the misconception taken granted. Overfitting's definition appeared to be inherently about comparing complexities of two (or more) models. Models manifest themself as inductive biases modeller or data scientist inform in their tasks. This makes overfitting in reality a Bayesian concept at its core. It is not about comparing training and test learning curves that if model is following a noise, but pairwise model comparison-testing procedure to select more plausable belief among our beliefs that has the least information: entities should not be multiplied beyond necessity i.e., Occam's razor. We introduce a new concept in clarifying this practically, goodness of rank to distinguish from well known goodness of fit, and clarify concepts and provide steps to attribute models with overfitted or under-fitted models.
Poorly generalised model : Overgeneralisation or under-generalisation
The practice that is described in machine learning classes, and practiced in industry that overfitting is about your model following training set closely but fails to generalised in test set. This is not overfitted model but a model that fails to generalise: a phenomena should be called Overgeneralisation (or under-generalisation).
A procedure to detect overfitted model : Goodness of rank
We have provided complexity based abstract description of model selection procedure, here as complexity ranking: we will repeat this procedure with identification of the overfilled model explicitly.
- Define a complexity measure $\mathscr{C}$($\mathscr{M}$) over an inductive bias.
- Define a generalisation measure $\mathscr{G}$($\mathscr{M}$, $\mathscr{D}$) over and inductive bias and dataset.
- Select a set of inductive biases, at least-two, $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$.
- Produce complexity and generalisation measures on ($\mathscr{M}$, $\mathscr{D}$): Here for two inductive biases: $\mathscr{C}_{1}$, $\mathscr{C}_{2}$, $\mathscr{G}_{1}$, $\mathscr{G}_{2}$.
- Ranking of $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$: $argmax \{ \mathscr{G}_{1}, \mathscr{G}_{2}\}$ and $argmin \{ \mathscr{C}_{1}, \mathscr{C}_{2}\}$
- $\mathscr{M}_{1}$ is an overfitted model compare to $\mathscr{M}_{2}$ if $\mathscr{G}_{1} <= \mathscr{G}_{2}$ and $\mathscr{C}_{1} \gt \mathscr{C}_{2}$.
- $\mathscr{M}_{2}$ is an overfitted model compare to $\mathscr{M}_{1}$ if $\mathscr{G}_{2} <= \mathscr{G}_{1}$ and $\mathscr{C}_{2} \gt \mathscr{C}_{1}$.
- $\mathscr{M}_{1}$ is an underfitted model compare to $\mathscr{M}_{2}$ if $\mathscr{G}_{1} < \mathscr{G}_{2}$ and $\mathscr{C}_{1} < \mathscr{C}_{2}$.
- $\mathscr{M}_{2}$ is an underfitted model compare to $\mathscr{M}_{1}$ if $\mathscr{G}_{2} < \mathscr{G}_{1}$ and $\mathscr{C}_{2} < \mathscr{C}_{1}$.
Some of the posts, reverse chronological order, that this blog have tried to convey what overfitting entails and its general implications.
- Empirical risk minimization is not learning :A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning
- Critical look on why deployed machine learning model performance degrade quickly.
- Bringing back Occam's razor to modern connectionist machine learning.
- Understanding overfitting: an inaccurate meme in Machine Learning
The holy grail of machine learning in practice is hold-out methods. We want to make sure that we don’t overgeneralise. However, a misconception has been propagated that overgeneralisation is mistakenly thought of as synonymous with overfitting. Overfitting has a different connotation as ranking different models rather than measuring the generalisation ability of a single model. The generalisation gap between training and test sets is not about Occam’s razor.