Scientific Memo

Preamble

Walt Disney Hall,
Los Angeles (Wikipedia)

Unfortunately, it is still thought in machine learning classes that overfitting can be detected by comparing training and test learning curves on the single model's performance. The origins of this misconception is unknown. Looks like an urban legend has been diffused into main practice and even in academic works the misconception taken granted. Overfitting's definition appeared to be inherently about comparing complexities of two (or more) models. Models manifest themself as inductive biases modeller or data scientist inform in their tasks. This makes overfitting in reality a Bayesian concept at its core. It is not about comparing training and test learning curves that if model is following a noise, but pairwise model comparison-testing procedure to select more plausable belief among our beliefs that has the least information: entities should not be multiplied beyond necessity i.e., Occam's razor. We introduce a new concept in clarifying this practically, goodness of rank to distinguish from well known goodness of fit, and clarify concepts and provide steps to attribute models with overfitted or under-fitted models.

Poorly generalised model : Overgeneralisation or under-generalisation

The practice that is described in machine learning classes, and practiced in industry that overfitting is about your model following training set closely but fails to generalised in test set. This is not overfitted model but a model that fails to generalise: a phenomena should be called Overgeneralisation (or under-generalisation).

A procedure to detect overfitted model : Goodness of rank

We have provided complexity based abstract description of model selection procedure, here as complexity ranking: we will repeat this procedure with identification of the overfilled model explicitly.

In the following steps a sketch of an algorithmic recipe for complexity ranking of inductive biases via informal steps, overfitted model identification explicitly:

Define a complexity measure $\mathscr{C}$($\mathscr{M}$) over an inductive bias.
Define a generalisation measure $\mathscr{G}$($\mathscr{M}$, $\mathscr{D}$) over and inductive bias and dataset.
Select a set of inductive biases, at least-two, $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$.
Produce complexity and generalisation measures on ($\mathscr{M}$, $\mathscr{D}$): Here for two inductive biases: $\mathscr{C}_{1}$, $\mathscr{C}_{2}$, $\mathscr{G}_{1}$, $\mathscr{G}_{2}$.
Ranking of $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$: $argmax \{ \mathscr{G}_{1}, \mathscr{G}_{2}\}$ and $argmin \{ \mathscr{C}_{1}, \mathscr{C}_{2}\}$
$\mathscr{M}_{1}$ is an overfitted model compare to $\mathscr{M}_{2}$ if $\mathscr{G}_{1} <= \mathscr{G}_{2}$ and $\mathscr{C}_{1} \gt \mathscr{C}_{2}$.
$\mathscr{M}_{2}$ is an overfitted model compare to $\mathscr{M}_{1}$ if $\mathscr{G}_{2} <= \mathscr{G}_{1}$ and $\mathscr{C}_{2} \gt \mathscr{C}_{1}$.
$\mathscr{M}_{1}$ is an underfitted model compare to $\mathscr{M}_{2}$ if $\mathscr{G}_{1} < \mathscr{G}_{2}$ and $\mathscr{C}_{1} < \mathscr{C}_{2}$.
$\mathscr{M}_{2}$ is an underfitted model compare to $\mathscr{M}_{1}$ if $\mathscr{G}_{2} < \mathscr{G}_{1}$ and $\mathscr{C}_{2} < \mathscr{C}_{1}$.

If two model has the same complexity, then much better generalised model should be selected, in this case we can't conclude that either model is overfitted but generalised differently. Remembering that overfitting is about complexity ranking : Goodness of rank.

But overgeneralisation sounded like overfitting, isn't it?

Operationally overgeneralisation and overfitting implies two different things. Overgeneralisation operationally can be detected with a single model. Because, we can measure the generalisation performance of the model alone with data, in statistical literature this is called goodness of fit. Moreover overgeneralisation can also be called under-generalisation, as both implies poor generalisation performance.

However, overfitting implies a model overly performed compare to an other model i.e., model overfits but compare to what? Practically speaking, overgeneralisation can be detected via holdout method, but not overfitting. Overfitting goes beyond goodness of fit to goodness of rank as we provided recipe as pairwise model comparison.

Conclusion

The practice of comparing training and test learning curves for overfitting diffused into machine learning so deeply, the concept is almost always thought a bit in a fuzzy-way, even in distinguished lectures explicitly. Older textbooks and papers correctly identifies overfitting as comparison problem. As a practitioner, if we bear in mind that overfitting is about complexity ranking and it requires more than one model or inductive bias in order to be identified, then we are in better shape in selecting better model. Overfitting can not be detected via data alone on a single model.

Further reading

Some of the posts, reverse chronological order, that this blog have tried to convey what overfitting entails and its general implications.

Glossary

To make things clear, we provide concept definitions.

Generalisation A concept that if model can perform as good as the data it has not seen before, however seen here is a bit vague, it could have seen data points that are close to the data would be better suited in the context of supervised learning as oppose to compositional learning.

Goodness of fit An approach to check if model is generalised well.

Goodness of rank An approach to check if model is overfitted or under-fitted comparable to other models.

Holdout method A metod to build a model on the portion of available data and measure the goodness of fit on the holdout part of the data, i.e., test and train.

Inductive bias A set of assumptions data scientist made in building a representation of the real world, this manifest as model and the assumptions that come with a model.

Model A model is a biased view of the reality from data scientist. Usually appears as a function of observables $X$ and parameters $\Theta$, $f(X, \Theta)$. The different values of $\Theta$ do not constitute a different model. See also What is a statistical model?, Peter McCullagh

Occam's razor (Principle of parsimony) A principle that less complex explanation reflects reality better. Entities should not be multiplied beyond necessity.

Overgeneralisation (Under-generalisation) If we have a good performance on the training set but very bad performance on the test set, model said to overgeneralise or under-generalise; as a result of goodness of fit testing, i.e., comparing learning curves over test and train datasets.

Regularisation An approach to augment model to improve generalisation.

Postscript Notes

Note: Occam’s razor is a ranking problem: Generalisation is not

The holy grail of machine learning in practice is hold-out methods. We want to make sure that we don’t overgeneralise. However, a misconception has been propagated that overgeneralisation is mistakenly thought of as synonymous with overfitting. Overfitting has a different connotation as ranking different models rather than measuring the generalisation ability of a single model. The generalisation gap between training and test sets is not about Occam’s razor.

Scientific Memo

Tuesday, 20 December 2022

The concept of overgeneralisation and goodness of rank : Overfitting is not about comparing training and test learning curves

No comments:

Post a Comment

Mehmet Suzen

Related

(c) Copyright 2008-2024 Mehmet Suzen (suzen at acm dot org)