Preamble
Figure: Moon patterns human brain invents. (Wikipedia) |
Misconceptions: Poor generalisation is not synonymous with overfitting.
None of these techniques would prevent us from overfitting: Cross-validation, having more data, early stopping, and comparing test-train learning curves are all about generalisation. Their purpose is not to detect overfitting.
We need at least two different models, i.e., two different inductive biases, to judge which model is overfitted. One distinct approach in deep learning, called dropout, prevents overfitting while it alternates between multiple models, i.e., multiple inductive bias. For judgment, dropout implementation has to compare those alternating model test performances during training to judge overfitting.
What is an inductive bias?
There are multiple inceptions of inductive bias. Here, we concentrate on a parametrised model, $\mathscr{M}(\theta)$ on a dataset $\mathscr{D}$, the selection of a model type, or modelling approach, usually manifest as a functional form $\mathscr{M}=f(x)$ or as a function approximation, i.e., for example neural network, are all manifestation of inductive biases. Different parameterisation of model learned on the subsets of the dataset are still the same inductive bias.
Complexity ranking of inductive biases: An Algorithmic recipe
We are sketching out an algorithmic recipe for complexity ranking of inductive biases via informal steps:
- Define a complexity measure $\mathscr{C}$($\mathscr{M}$) over an inductive bias.
- Define a generalisation measure $\mathscr{G}$($\mathscr{M}$, $\mathscr{D}$) over and inductive bias and dataset.
- Select a set of inductive biases, at least-two, $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$.
- Produce complexity and generalisation measures on ($\mathscr{M}$, $\mathscr{D}$): Here for two inductive biases: $\mathscr{C}_{1}$, $\mathscr{C}_{2}$, $\mathscr{G}_{1}$, $\mathscr{G}_{2}$.
- Ranking of $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$: $argmax \{ \mathscr{G}_{1}, \mathscr{G}_{2}\}$ and $argmin \{ \mathscr{C}_{1}, \mathscr{C}_{2}\}$
The core concept appears as when generalisations are close enough we pick out the inductive bias that is less complex.
Conclusion & Outlook
In practice, probably due to hectic delivery constraints, or mere laziness, we still rely on simple holdout method to build models, only single test and train split, not even learning curves, specially in deep learning models without practicing Occam's razor. A major insight in this direction appears to be that, holdout approach can only help us to detect generalisation, not overfitting. We clarify this via the concept of inductive bias distinguishing that different parametrisation of the same model doesn't change the inductive bias introduced by the modelling choice.
In fact, due to resource constraints of model life-cycle, i.e., energy consumption and cognitive load of introducing a complex model, practicing proper Occam's razor: complexity ranking of inductive biases, is much more important than ever for sustainable environment and human capital.
Further reading
Some of the posts, reverse chronological order, that this blog have tried to convey what overfitting entails and its general implications.
- Empirical risk minimization is not learning :A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning
- Critical look on why deployed machine learning model performance degrade quickly.
- Bringing back Occam's razor to modern connectionist machine learning.
- Understanding overfitting: an inaccurate meme in Machine Learning