Preamble
Occam (Wikipedia) |
Learning Curve Setting: Generalisation Gap
Learning curves explain how a given algorithm's generalisation improves over time or experience, originating from Ebbinghaus's work on human memory. We use inductive bias to express a model, as model can manifest itself in different forms from differential equations to deep learning.
Definition: Given inductive bias $\mathscr{M}$ formed by $n$ datasets with monotonically increasing sizes $\mathbb{T} = \{|\mathbb{T}_{0}| > |\mathbb{T}_{1}| > ...> |\mathbb{T}_{n}| \}$. A learning curve $\mathscr{L}$ for $\mathscr{M}$ is expressed by the performance measure of the model over datasets, $\mathbb{p} = \{ p_{0}, p_{1}, ... p_{n} \}$, hence $\mathscr{L}$ is a curve on the plane of $(\mathbb{T}, p)$.
By this definition, we deduce that $\mathscr{M}$ learns if $\mathscr{L}$ increases monotonically.
A generalisation gap is defined as follows.
Definition: Generalisation gap for inductive bias $\mathscr{M}$ is the difference between its' learning curve $\mathscr{L}$ and the learning curve of the unseen datasets, i.e., so-called training, $\mathscr{L}^{train}$. The difference can be simple difference, or a measure differentiating the gap.
We conjecture the following.
Conjecture: Generalisation gap can't identify if $\mathscr{M}$ is an overfitted model. Overfitting is about Occam's razor, and requires a pairwise comparison between two inductive biases of different complexities.
As conjecture suggests that generalisation gap is not about overfitting, despite the common misconception. Then, why the misconception? The misconception lies on the confusion of how to produce the curve that we could judge overfitting.
Occam Curves: Overfitting Gap [Occam's Gap]
Further reading & notes
- Further posts and a glossary : The concept of overgeneralisation and goodness of rank.
- Double decent phenomenon, it uses Occam's curves, not learning curves.
- We use dataset size as an interpretation of increasing experience, there could be other ways of expressing a gained experience, but we take the most obvious evidence.
Model selection and evaluations are usually confused by novice and as well as experienced data scientists and professionals doing modelling. There are a lot of misconceptions in the literature, but in practice primary take home messages can be summarised as follows:
1. What is a model? A model is an “inductive bias” of the modeller, a selected parametrised functions for example, a neural network architecture choice. Contrary to many, specific parametrisation of a model (deep learning architecture) is not a different model.
2. A model’s test and training performance difference is about generalisation gap. Overfitting and under-fitting is not about generalisation gap.
3. Overfitting or under-fitting is a comparison problem: How a model deviates from a reference model? This is called Occam’s gap or so called model selection error.
4. Occam’s gap generalises Empirical Risk minimisation over a learning curve. Empirical risk minimisation itself is not about learning.
How and when a model generalises well and generalisation of empirical risk minimisation are currently an open research topics.