Scientific Memo

Tuesday, 20 December 2022

The concept of overgeneralisation and goodness of rank : Overfitting is not about comparing training and test learning curves

Preamble

Walt Disney Hall,
Los Angeles (Wikipedia)

Unfortunately, it is still thought in machine learning classes that overfitting can be detected by comparing training and test learning curves on the single model's performance. The origins of this misconception is unknown. Looks like an urban legend has been diffused into main practice and even in academic works the misconception taken granted. Overfitting's definition appeared to be inherently about comparing complexities of two (or more) models. Models manifest themself as inductive biases modeller or data scientist inform in their tasks. This makes overfitting in reality a Bayesian concept at its core. It is not about comparing training and test learning curves that if model is following a noise, but pairwise model comparison-testing procedure to select more plausable belief among our beliefs that has the least information: entities should not be multiplied beyond necessity i.e., Occam's razor. We introduce a new concept in clarifying this practically, goodness of rank to distinguish from well known goodness of fit, and clarify concepts and provide steps to attribute models with overfitted or under-fitted models.

Poorly generalised model : Overgeneralisation or under-generalisation

The practice that is described in machine learning classes, and practiced in industry that overfitting is about your model following training set closely but fails to generalised in test set. This is not overfitted model but a model that fails to generalise: a phenomena should be called Overgeneralisation (or under-generalisation).

A procedure to detect overfitted model : Goodness of rank

We have provided complexity based abstract description of model selection procedure, here as complexity ranking: we will repeat this procedure with identification of the overfilled model explicitly.

In the following steps a sketch of an algorithmic recipe for complexity ranking of inductive biases via informal steps, overfitted model identification explicitly:

Define a complexity measure $\mathscr{C}$($\mathscr{M}$) over an inductive bias.
Define a generalisation measure $\mathscr{G}$($\mathscr{M}$, $\mathscr{D}$) over and inductive bias and dataset.
Select a set of inductive biases, at least-two, $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$.
Produce complexity and generalisation measures on ($\mathscr{M}$, $\mathscr{D}$): Here for two inductive biases: $\mathscr{C}_{1}$, $\mathscr{C}_{2}$, $\mathscr{G}_{1}$, $\mathscr{G}_{2}$.
Ranking of $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$: $argmax \{ \mathscr{G}_{1}, \mathscr{G}_{2}\}$ and $argmin \{ \mathscr{C}_{1}, \mathscr{C}_{2}\}$
$\mathscr{M}_{1}$ is an overfitted model compare to $\mathscr{M}_{2}$ if $\mathscr{G}_{1} <= \mathscr{G}_{2}$ and $\mathscr{C}_{1} \gt \mathscr{C}_{2}$.
$\mathscr{M}_{2}$ is an overfitted model compare to $\mathscr{M}_{1}$ if $\mathscr{G}_{2} <= \mathscr{G}_{1}$ and $\mathscr{C}_{2} \gt \mathscr{C}_{1}$.
$\mathscr{M}_{1}$ is an underfitted model compare to $\mathscr{M}_{2}$ if $\mathscr{G}_{1} < \mathscr{G}_{2}$ and $\mathscr{C}_{1} < \mathscr{C}_{2}$.
$\mathscr{M}_{2}$ is an underfitted model compare to $\mathscr{M}_{1}$ if $\mathscr{G}_{2} < \mathscr{G}_{1}$ and $\mathscr{C}_{2} < \mathscr{C}_{1}$.

If two model has the same complexity, then much better generalised model should be selected, in this case we can't conclude that either model is overfitted but generalised differently. Remembering that overfitting is about complexity ranking : Goodness of rank.

But overgeneralisation sounded like overfitting, isn't it?

Operationally overgeneralisation and overfitting implies two different things. Overgeneralisation operationally can be detected with a single model. Because, we can measure the generalisation performance of the model alone with data, in statistical literature this is called goodness of fit. Moreover overgeneralisation can also be called under-generalisation, as both implies poor generalisation performance.

However, overfitting implies a model overly performed compare to an other model i.e., model overfits but compare to what? Practically speaking, overgeneralisation can be detected via holdout method, but not overfitting. Overfitting goes beyond goodness of fit to goodness of rank as we provided recipe as pairwise model comparison.

Conclusion

The practice of comparing training and test learning curves for overfitting diffused into machine learning so deeply, the concept is almost always thought a bit in a fuzzy-way, even in distinguished lectures explicitly. Older textbooks and papers correctly identifies overfitting as comparison problem. As a practitioner, if we bear in mind that overfitting is about complexity ranking and it requires more than one model or inductive bias in order to be identified, then we are in better shape in selecting better model. Overfitting can not be detected via data alone on a single model.

Further reading

Some of the posts, reverse chronological order, that this blog have tried to convey what overfitting entails and its general implications.

Glossary

To make things clear, we provide concept definitions.

Generalisation A concept that if model can perform as good as the data it has not seen before, however seen here is a bit vague, it could have seen data points that are close to the data would be better suited in the context of supervised learning as oppose to compositional learning.

Goodness of fit An approach to check if model is generalised well.

Goodness of rank An approach to check if model is overfitted or under-fitted comparable to other models.

Holdout method A metod to build a model on the portion of available data and measure the goodness of fit on the holdout part of the data, i.e., test and train.

Inductive bias A set of assumptions data scientist made in building a representation of the real world, this manifest as model and the assumptions that come with a model.

Model A model is a biased view of the reality from data scientist. Usually appears as a function of observables $X$ and parameters $\Theta$, $f(X, \Theta)$. The different values of $\Theta$ do not constitute a different model. See also What is a statistical model?, Peter McCullagh

Occam's razor (Principle of parsimony) A principle that less complex explanation reflects reality better. Entities should not be multiplied beyond necessity.

Overgeneralisation (Under-generalisation) If we have a good performance on the training set but very bad performance on the test set, model said to overgeneralise or under-generalise; as a result of goodness of fit testing, i.e., comparing learning curves over test and train datasets.

Regularisation An approach to augment model to improve generalisation.

Postscript Notes

Note: Occam’s razor is a ranking problem: Generalisation is not

The holy grail of machine learning in practice is hold-out methods. We want to make sure that we don’t overgeneralise. However, a misconception has been propagated that overgeneralisation is mistakenly thought of as synonymous with overfitting. Overfitting has a different connotation as ranking different models rather than measuring the generalisation ability of a single model. The generalisation gap between training and test sets is not about Occam’s razor.

Monday, 5 December 2022

The conditional query fallacy: Applying Bayesian inference from discrete mathematics perspective

Preamble

The Tilled Field,
Joan Miró
(Wikipedia)

One of the core concepts in data sciences is conditional probabilities, $p(x|y)$ appear as logical description of many of the tasks, such as formulating regression or as a core concept in Bayesian Inference. However, there is operationally no special meaning of a conditional or joint probabilities as their arguments are no more than a compositional event statements. This raise a question: Is there any fundamental relationship between Bayesian Inference and discrete mathematics that is practically relevant to us as practitioners? Since, both topics are based on discrete statements returning a Boolean variables. Unfortunately, the answer to this question is a rabbit hole and probably even an open research. There is no clearly established connections between discrete mathematics fundamentals and Bayesian Inference.

Statement mappings as definition of probability

Statement is a logical description of some events, or set of events. Let's have a semi-formal description of such statements.

Definition: A mathematical or logical statement formed with boolean relationships $\mathscr{R}$ (conjunctions) among set of events $\mathscr{E}$, so a statement $\mathbb{S}$ is formed with at least a tuple of $\langle \mathscr{R}, \mathscr{E} \rangle$.

Relationships can be any binary operator and events could explain anything perceptional, i.e., a discretised existence. This is the core discrete mathematics and almost all problems in this domain formed in this setting from defining functions to graph theory. A probability is no exception and definition naturally follows, as so called statement mapping.

Definition: A probability $\mathbb{P}$ is a statement mapping, $\mathbb{P}: \mathbb{S} \rightarrow [0,1]$.

The interpretation of this definitions that a logical statement is always True if probability is 1 and always False if it is 0. However, having conditionals based on this is not that clear cut.

Conditional Query Fallacy

A non-commutative statement may imply, reversing the order of statements should not yield to the same filtered set on the data for Bayesian Inference. However, Bayes' theorem would have a fallacy for statement mappings for conditionals in this sense.

Definition: The conditional query fallacy is defined as one can not update belief in probability, because reversing order of statements in conditional probabilities halts Bayes' update, i.e., back to back query results into the same dataset for inference.

At first glance, this appears as a Bayes' rule does not support commutative property, practically posterior being equal to likelihood. However, this fallacy appears to be a notational misdirection. Inference on the filtered dataset back to back constituting a conditional fallacy i.e., when a query language is used to filter data to get A|B and B|A yielding to the same dataset regardless of filtering order.

Although, in inference with data, likelihood is actually not a conditional probability, strictly speaking and not a filtering operation. It is merely a measure of update rule. We compute likelihood by multiplying values obtained by i.i.d. samples inserted into conjugate prior, a distribution is involved. Hence, the likelihood computationally is not really a reversal of conditional as in $P(A|B)$ written as reversed, $P(B|A)$.

Outlook

In computing conditional probabilities for Bayesian Inference, our primary assumption is that conditional probabilities; likelihood and posterior are not identical. Discrete mathematics only allows Bayesian updates, if time evolution is explicitly stated with non-commutative statements for conditionals.

Going back to our initial question, indeed there is a deep connection between the fundamentals of discrete mathematics and Bayesian belief update on events as logical statements. The fallacy sounds a trivial error in judgement but (un)fortunately goes into philosophical definitions of probability that simultaneous tracking of time and sample space is not encoded in any of the notations explicitly, making statement filtering definition of probability a bit shaky.

Glossary of concepts

Statement Mapping A given set of mathematical statements mapped into a domain of numbers.

Probability A statement mapping, where domain is $\mathscr{D} = [0,1]$.

Conditional query fallacy Differently put than the above definition. Thinking that two conditional probabilities as reversed statements of each other in Bayesian inference, yields to the same dataset regardless of time-ordering of the queries.

Notes and further reading

Fallacy is one computes $P(A|B)=P(B|A)$, while filtering results into identical datasets. Correction would be that, one needs to use different sample sizes for reverse statement or compute joints and marginals separately on their own filtered datasets. Use the first filtering sample size in computing the probability not the total.
Here, discrete mathematics we refer to appears within arguments of probability. The discussion of discrete parameter estimations are a different topic. Gelman discusses this, here.
Conjunction Fallacy
Probability Interpretations
Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra M. Süzen (2022)
Holes in Bayesian Statistics Gelman-Yao (2021) : This is a beautifully written article. Specially, proposal that context dependence should be used instead of subjective
John Allen Paulos’s Innumeracy: Mathematical Illiteracy and Its Consequences,
Random House (1988). Interestingly Paullos mentions fallacies regarding the conditional probabilities.

An oldy book on numerical fallacies but goodie.