Tuesday, 20 December 2022

The concept of overgeneralisation and goodness of rank : Overfitting is not about comparing training and test learning curves

Preamble 

    Walt Disney Hall,
Los Angeles  (Wikipedia)


Unfortunately, it is still thought in machine learning classes that overfitting can be detected by comparing training and test learning curves on the single model's performance.  The origins of this misconception is unknown. Looks like an urban legend has been diffused into main practice  and  even in academic works the misconception taken granted. Overfitting's definition appeared to be inherently about comparing complexities of two (or more) models. Models manifest themself as inductive biases modeller or data scientist inform in their tasks. This makes overfitting in reality a Bayesian concept at its core.  It is not about comparing training and test learning curves that if model is following a noise, but pairwise model comparison-testing procedure to select  more plausable belief among our beliefs that has the least information: entities should not be multiplied beyond necessity i.e., Occam's razor. We introduce a new concept in clarifying this practically, goodness of rank to distinguish from well known goodness of fit, and clarify concepts and provide steps to attribute models with overfitted or under-fitted models.

Poorly generalised model : Overgeneralisation or under-generalisation

The practice that is described in machine learning classes, and  practiced in industry that overfitting is about your model following training set closely but fails to generalised in test set. This is not overfitted model but a model that fails to generalise: a phenomena should be called Overgeneralisation (or under-generalisation). 

A procedure to detect overfitted model : Goodness of rank

We have provided complexity based abstract description of model selection procedure, here as complexity ranking: we will repeat this procedure with identification of the overfilled model explicitly.

In the following steps a sketch of an algorithmic recipe for complexity ranking of inductive biases via informal steps, overfitted model identification explicitly:

  1. Define a complexity measure $\mathscr{C}$($\mathscr{M}$) over an inductive bias.
  2. Define a generalisation measure  $\mathscr{G}$($\mathscr{M}$, $\mathscr{D}$) over and inductive bias and dataset.
  3. Select a set of inductive biases, at least-two, $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$.
  4. Produce complexity and generalisation measures on ($\mathscr{M}$, $\mathscr{D}$): Here for two inductive biases: $\mathscr{C}_{1}$, $\mathscr{C}_{2}$,   $\mathscr{G}_{1}$, $\mathscr{G}_{2}$.
  5. Ranking of  $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$:  $argmax \{ \mathscr{G}_{1}, \mathscr{G}_{2}\}$ and $argmin \{ \mathscr{C}_{1}, \mathscr{C}_{2}\}$ 
  6. $\mathscr{M}_{1}$ is an overfitted model compare to $\mathscr{M}_{2}$   if $\mathscr{G}_{1} <= \mathscr{G}_{2}$ and   $\mathscr{C}_{1} \gt \mathscr{C}_{2}$. 
  7. $\mathscr{M}_{2}$ is an overfitted model compare to $\mathscr{M}_{1}$ if $\mathscr{G}_{2} <= \mathscr{G}_{1}$ and   $\mathscr{C}_{2} \gt \mathscr{C}_{1}$.
  8. $\mathscr{M}_{1}$ is an underfitted model compare to $\mathscr{M}_{2}$   if $\mathscr{G}_{1} < \mathscr{G}_{2}$ and   $\mathscr{C}_{1} < \mathscr{C}_{2}$.
  9. $\mathscr{M}_{2}$ is an underfitted model compare to $\mathscr{M}_{1}$   if $\mathscr{G}_{2} < \mathscr{G}_{1}$ and   $\mathscr{C}_{2} < \mathscr{C}_{1}$.
If two model has the same complexity, then much better generalised model should be selected, in this case we can't conclude that either model is overfitted but generalised differently. Remembering that overfitting is about complexity ranking : Goodness of rank.

But overgeneralisation sounded like overfitting, isn't it?

Operationally overgeneralisation and overfitting implies two different things. Overgeneralisation operationally can be detected with a single model. Because, we can measure the generalisation performance of the model alone with data, in statistical literature this is called goodness of fit. Moreover overgeneralisation can also be called under-generalisation, as both implies poor generalisation performance.

However, overfitting implies a model overly performed compare to an other model i.e., model overfits but compare to what? Practically speaking, overgeneralisation can be detected via holdout method, but not overfitting. Overfitting goes beyond goodness of fit to goodness of rank as we provided recipe as pairwise model comparison.

Conclusion

The practice of comparing training and test learning curves for overfitting diffused into machine learning so deeply, the concept is almost always thought a bit in a fuzzy-way, even in distinguished lectures explicitly. Older textbooks and papers correctly identifies overfitting as comparison problem. As a practitioner, if we bear in mind that overfitting is about complexity ranking and it requires more than one model or inductive bias in order to be identified, then we are in better shape in selecting better model. Overfitting can not be detected via data alone on a single model.  


To make things clear, we provide concept definitions.

Generalisation A concept that if model can perform as good as the data it has not seen before, however seen here is a bit vague, it could have seen data points that are close to the data would be better suited in the context of supervised learning as oppose to compositional learning.

Goodness of fit An approach to check if model is generalised well.  

Goodness of rank An approach to check if model is overfitted or under-fitted comparable to other models.

Holdout method A metod to build a model on the portion of available data and measure the goodness of fit on the holdout part of the data, i.e., test and train.

Inductive bias  A set of assumptions data scientist made in building a representation of the real world, this manifest as model and the assumptions that come with a model.

Model  A model is a biased view of the reality from data scientist. Usually appears as a function of observables $X$ and parameters $\Theta$, $f(X, \Theta)$. The different values of $\Theta$ do not constitute a different model.  See also  What is a statistical model?,  Peter McCullagh 

Occam's razor (Principle of parsimony)  A principle that less complex explanation reflects reality better. Entities should not be multiplied beyond necessity.  

Overgeneralisation (Under-generalisation) If we have a good performance on the training set but very bad performance on the test set, model said to overgeneralise or under-generalise; as a result of goodness of fit testing, i.e., comparing learning curves over test and train datasets.

Regularisation An approach to augment model to improve generalisation.

Postscript Notes

Note: Occam’s razor is a ranking problem: Generalisation is not 

The holy grail of machine learning in practice is hold-out methods. We want to make sure that we don’t overgeneralise. However, a misconception has been propagated that overgeneralisation is mistakenly thought of as synonymous with overfitting. Overfitting has a different connotation as ranking different models rather than measuring the generalisation ability of a single model. The generalisation gap between training and test sets is not about Occam’s razor. 

Monday, 5 December 2022

The conditional query fallacy: Applying Bayesian inference from discrete mathematics perspective

Preamble

    The Tilled Field,
Joan Miró
(Wikipedia)
One of the core concepts in data sciences is conditional probabilities, $p(x|y)$ appear as logical description of many of the tasks, such as formulating regression or as a core concept in Bayesian Inference. However, there is operationally no special meaning of a conditional or joint probabilities as their arguments are no more than a compositional event statements. This raise a question: Is there any fundamental relationship between Bayesian Inference and discrete mathematics that is practically relevant to us as practitioners? Since, both topics are based on discrete statements returning a Boolean variables. Unfortunately, the answer to this question is a rabbit hole and probably even an open research.  There is no clearly established connections between discrete mathematics fundamentals and Bayesian Inference.  

Statement mappings as definition of probability 

Statement is a logical description of some events, or set of events. Let's have a semi-formal description of such statements.

Definition: A mathematical or logical statement formed with boolean relationships $\mathscr{R}$ (conjunctions) among set of events $\mathscr{E}$,  so a statement $\mathbb{S}$ is formed with at least a tuple of $\langle \mathscr{R},  \mathscr{E} \rangle$. 

Relationships can be any binary operator and events could explain anything perceptional, i.e., a discretised existence. This is the core discrete mathematics and almost all problems in this domain formed in this setting from defining functions to graph theory. A probability is no exception and definition naturally follows, as so called statement mapping.

Definition: A probability $\mathbb{P}$ is a statement mapping, $\mathbb{P}: \mathbb{S} \rightarrow [0,1]$. 

The interpretation of this definitions that a logical statement is always True if probability is 1 and always False if it is 0. However, having conditionals based on this is not that clear cut.

Conditional Query Fallacy 

A non-commutative statement may imply, reversing the order of statements should not yield to the same filtered set on the data for Bayesian Inference. However, Bayes' theorem would have a fallacy for statement mappings for conditionals in this sense. 

Definition: The conditional query fallacy is defined as one can not update belief in probability, because reversing order of statements in conditional probabilities halts Bayes' update,  i.e., back to back query results into the same dataset for inference.

At first glance, this appears as a Bayes' rule does not support commutative property, practically posterior being equal to likelihood.  However, this fallacy appears to be a notational misdirection. Inference on the filtered dataset back to back constituting a conditional fallacy i.e., when a query language is used to filter data to get A|B and B|A yielding to the same dataset regardless of filtering order. 

Although, in inference with data, likelihood is actually not a conditional probability, strictly speaking and not a filtering operation. It is merely a measure of update rule. We compute likelihood by multiplying values obtained by i.i.d. samples inserted into conjugate prior, a distribution is involved. Hence, the likelihood computationally is not really a reversal of conditional as in $P(A|B)$ written as reversed, $P(B|A)$.  

Outlook

In computing conditional probabilities for Bayesian Inference, our primary assumption is that conditional probabilities; likelihood and posterior are not identical. Discrete mathematics only allows Bayesian updates, if time evolution is explicitly stated with non-commutative statements for conditionals.

Going back to our initial question, indeed there is a deep connection between the fundamentals of discrete mathematics and Bayesian belief update on events as logical statements. The fallacy sounds a trivial error in judgement but (un)fortunately goes into philosophical definitions of probability that simultaneous tracking of time and sample space is not encoded in any of the notations explicitly, making statement filtering definition of probability a bit shaky.

Glossary of concepts

Statement Mapping A given set of mathematical statements mapped into a domain of numbers.

Probability A statement mapping, where domain is $\mathscr{D} = [0,1]$.

Conditional query fallacy Differently put than the above definition. Thinking that two conditional probabilities as reversed statements of each other in Bayesian inference, yields to the same dataset regardless of time-ordering of the queries.

Notes and further reading

Tuesday, 15 November 2022

Differentiating ensembles and sample spaces: Alignment between statistical mechanics and probability theory

Preamble 

Sample space is the primary concept introduced in any probability and statistics books and in papers. However, there needs to be more clarity about what constitutes a sample space in general: there is no explicit distinction between the unique event set and the replica sets. The resolution of this ambiguity lies in the concept of an ensemble.  The concept is first introduced by American theoretical physicist and engineer Gibbs in his book Elementary principle of statistical mechanics The primary utility of an ensemble is a mathematical construction that differentiates between samples and how they would form extended objects. 

In this direction, we provide the basics of constructing ensembles in a pedagogically accessible way from sample spaces that clears up a possible misconception. This usage of ensemble prevents the overuse of the term sample space for different things. We introduce some basic formal definitions.

    Figure: Gibbs's book
 introduced the concept of
ensemble (Wikipedia).

What Gibbs's had in mind by constructing statistical ensembles?

A statistical ensemble is a mathematical tool that connects statistical mechanics to thermodynamics. The concept lies in defining microscopic states for molecular dynamics; in statistics and probability, this corresponds to a set of events. Though these events are different at a microscopic level, they are sampled from a single thermodynamics ensemble, a representative of varying material properties or, in general, a set of independent random variables. In dynamics, micro-states samples an ensemble. This simple idea has helped Gibbs to build a mathematical formalism of statistical mechanics companion to Boltzmann's theories.

Differentiating sample space and ensemble in general

The primary confusion in probability theory on what constitutes a samples space is that there is no distinction between primitive events or events composed of primitive events. We call both sets sample space. This terminology easily overlooked in general as we concentrate on events set but not the primitive events set in solving practical problems.   

Definition: A primitive event $\mathscr{e}$ implies a logically distinct unit of experimental realisation that has not composed of any other events.

Definition: A sample space $\mathscr{S}$ is a set formed by all $N$ distinct primitive events $\mathscr{e}_{i}$.  

By this definition, regardless of how many fair coins are used or if a coin toss in a sequence for the experiment, the sample space is always ${H,T}$, because these are the most primitive distinct events a system can have, i.e., a single coin outcomes. However, the statistical ensemble can be different.  For example for two fair coins or coin toss in sequence of length two, corresponding ensemble of system size two reads ${HH, TT, HT, TH}$. Then, the definition of ensemble follows. 

Definition: An ensemble  $\mathscr{E}$ is a set of ordered set of primitive events $\mathscr{e}_{i}$. These event sets can be sampled with replacement but order matters, i.e., $ \{e_{i}, e_{j} \} \ne  \{e_{j}, e_{i} \}$, $i \ne j$.

Our two coin example's ensemble should be formally written as $\mathscr{E}=\{\{H,H\}, \{T,T\}, \{H,T\}, \{T,H\}\}$, as order matters members $HT$ and $TH$ are distinct. Obviously for a single toss ensemble and a sample space will be the same. 

Ergodicity makes the need for differentiation much more clear : Time and ensemble averaging 

The above distinction makes building time and ensemble averaging much easier. The term ensemble averaging is obvious as we know what would be the ensemble set and averaging over this set for a given observable.  Time averaging then could be achieved by curating a much larger set by resampling with replacement from the ensemble. Note that the resulting time-average value would not be unique, as one can generate many different sample sets from the ensemble. However, bear in mind that the definition of how to measure convergence to ergodic regime is not unique.

Conclusion

Even though the distinction we made sounds very obscure,  this alignment between statistical mechanics and probability theory may clarify the conception of ergodic regimes for general practitioners.

Further reading

Please Cite:

 @misc{suezen22dess, 
     title = {Differentiating ensembles and sample spaces: Alignment between statistical mechanics and probability theory}, 
     howpublished = {\url{https://science-memo.blogspot.com/2022/11/ensembles-probability-theory.html}, 
     author = {Mehmet Süzen},
     year = {2022}
}  

Postscript

  • If there are multiple events coming from set of primitive events, compositional outcomes considered to be ensemble not sample space. Sample space is a set that we sample from, either one or multiple times to build an ensemble. Ensemble notion within pure ML context was also noticed by late David J. C. MacKay, in his book Information Theory, Inference and Learning, Cambridge University Press (2003).


Tuesday, 25 October 2022

Overfitting is about complexity ranking of inductive biases : Algorithmic recipe

Preamble

    Figure: Moon patterns
human brain
 invents. (Wikipedia)
Detecting overfitting is inherently a comparison problem of the complexity of multiple objects, i.e., models or an algorithm capable of making predictions. A model is overfitted (underfitted) if we only compare it to another model. Model selection involves comparing multiple models with different complexities. The summary of this approach with basic mathematical definitions is given here.

Misconceptions: Poor generalisation is not synonymous with overfitting. 

None of these techniques would prevent us from overfitting: Cross-validation, having more data, early stopping, and comparing test-train learning curves are all about generalisation. Their purpose is not to detect overfitting.

We need at least two different models, i.e., two different inductive biases, to judge which model is overfitted. One distinct approach in deep learning, called dropout, prevents overfitting while it alternates between multiple models, i.e., multiple inductive bias. For judgment, dropout implementation has to compare those alternating model test performances during training to judge overfitting. 

What is an inductive bias? 

There are multiple inceptions of inductive bias. Here, we concentrate on a parametrised model, $\mathscr{M}(\theta)$ on a dataset $\mathscr{D}$, the selection of a model type, or modelling approach, usually manifest as a functional form $\mathscr{M}=f(x)$ or as a function approximation, i.e., for example neural network, are all manifestation of inductive biases. Different parameterisation of model learned on the subsets of the dataset are still the same inductive bias.

Complexity ranking of inductive biases: An Algorithmic recipe 

We are sketching out an algorithmic recipe for complexity ranking of inductive biases via informal steps:
  1. Define a complexity measure $\mathscr{C}$($\mathscr{M}$) over an inductive bias.
  2. Define a generalisation measure  $\mathscr{G}$($\mathscr{M}$, $\mathscr{D}$) over and inductive bias and dataset.
  3. Select a set of inductive biases, at least-two, $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$.
  4. Produce complexity and generalisation measures on ($\mathscr{M}$, $\mathscr{D}$): Here for two inductive biases: $\mathscr{C}_{1}$, $\mathscr{C}_{2}$,   $\mathscr{G}_{1}$, $\mathscr{G}_{2}$.
  5. Ranking of  $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$:  $argmax \{ \mathscr{G}_{1}, \mathscr{G}_{2}\}$ and $argmin \{ \mathscr{C}_{1}, \mathscr{C}_{2}\}$
The core concept appears as when generalisations are close enough we pick out the inductive bias that is less complex. 

Conclusion & Outlook

In practice,  probably due to hectic delivery constraints, or mere laziness, we still rely on simple holdout method to build models, only single test and train split, not even learning curves, specially in deep learning models without practicing Occam's razor. A major insight in this direction appears to be that, holdout approach can only help us to detect generalisation, not overfitting. We clarify this via the concept of inductive bias distinguishing that different parametrisation of the same model doesn't change the inductive bias introduced by the modelling choice. 

In fact, due to resource constraints of model life-cycle, i.e., energy consumption and cognitive load of introducing a complex model, practicing proper Occam's razor: complexity ranking of inductive biases, is much more important than ever for sustainable environment and human capital.

Further reading

Some of the posts, reverse chronological order, that this blog have tried to convey what overfitting entails and its general implications. 


Tuesday, 4 October 2022

Heavy-matter-wave and ultra-sensitive interferometry: An opportunity for quantum-gravity becoming an evidence based research

    Solar Eclipse of 1919
(wikipedia)

Preamble
 


   
Cool ideas in theoretical physics are ofter opaque for general reader whether if they are backed up with any experimental evidence in the real world. The success of LIGO (Laser Interferometer Gravitational-wave Observatory) definitely proven the value of interferometry for advancement of cool ideas of theoretical physics supported by real world measurable evidence. An other type of interferometry that could be used in testing multiple-different ideas from theoretical physics is called matter-wave interferometry or atom interferometry: It's been around decades but the new developments and increased sensitivity with measurement on heavy atomic system-waves will pave the technical capabilities to test multiple ideas of theoretical physics. 

Basic mathematical principle of interferometry

Usually interferometry is explained with device and experimental setting details that could be confusing. However,  one could explain the very principle without introducing any experimental setup.  The basic idea of of interferometry is that if a simple wave, such as $\omega(t)=\sin\Theta(t)$, is first split into two waves and reflected over the same distance, one with shifted with a constant phase, in the vacuum without any interactions. A linear combination of the returned waves $\omega_{1}(t)=\sin \Theta(t)$ and  $\omega_{2}(t)=\sin( \Theta(t) + \pi))$, will yield to zero, i.e.,  an interference pattern generated by $\omega_{1}(t)+\omega_{2}(t)=0$. This very basic principle can be used to detect interactions and characteristics of those interactions wave encounter over the time it travels to reflect and come back. Of course, the basic wave used in many interferometry experiments is the laser light and interaction we measure could be gravitational wave that interacts with the laser light i.e., LIGO's set-up.

Detection of matter-waves : What is heavy and ultra-sensitivity?

Each atomic system exhibits some quantum wave properties, i.e., matter waves. It implies a given molecular system have some wave signatures-characteristics which could be extracted in the experimental setting. Instead of laser light, one could use atomic system that is reflected similar to the basic principle. However, the primary difference is that increasing mass requires orders of magnitude more sensitive wave detectors for atomic interferometers. Currently heavy means usually above ~$10^{9}$ Da (comparing to Helium-4 which  is about ~4 Da), these new heavy atomic interferometers might be able to detect gravitational-interactions within quantum-wave level due to precisions achieved ultra-sensitive. This sounds trivial but experimental connection to theories of quantum-gravity, one of the unsolved puzzles in theoretical-physics appears to be a potential break-through. One prominent example in this direction is entropic gravity and wave-function collapse theories.  

Conclusion

Recent developments in heavy matter-wave interferometry could be leveraged for testing quantum-gravity arguments and theoretical suggestions. We try to bring this idea into general attention without resorting in describing experimental details. 

Further Reading & Notes
  • Dalton, mass-unit used in matter-wave interferometry. 
  • Atom Interferometry by Prof. Pritchard YouTube.
  • Newton-Schrödinger equation.
  • A roadmap for universal high-mass matter- wave interferometry  Kilka et. al. AVS Quantum Sci. 4, 020502 (2022). doi
    • Current capabilities as of 2022, atom interferometers can reach up to ~300 kDa.
  • Testing Entropic gravity, arXiv
  • NASA early stage ideas workshops : web-archive

Tuesday, 20 September 2022

Building robust AI systems: Is an artificial intelligent agent just a probabilistic boolean function?


Preamble
    George Boole (Wikipedia)

Agent, AI agent or an intelligent agent is used often to describe algorithms or AI systems that are released by research teams recently. However, the definition of an intelligent agent (IA) is a bit opaque. Naïvely thinking, it is nothing more than a decision maker that shows some intelligent behaviour. However, making a decision intelligently is hard to quantify computationally, and probably IA for us is something that can be representable as a Turing machine.  Here, we argue that an intelligent agent in the current AI systems should be seen as a function without side effects outputting a boolean output and shouldn't be extrapolated or compare to human level intelligence.  Causal inference capabilities should be seen as a scientific guidance to this function decompositions without side-effects,  i.e., Human in-the loop Probabilistic Boolean Functions (PBFs).

Computational learning theories are based on binary learners

Two of the major  theories of statistical learning PAC and VC dimensions build upon on "binary learning".  

PAC stands for Probably Approximately Correct, It sets basic framework and mathematical building blocks for defining a machine learning problem from complexity theory. Probably correct implies finding a weak learning function given binary instance set $X=\{1,0\}^{n}$. The binary set or its subsets mathematically called concepts and under certain mathematical conditions a system said to be PAC learnable. There are equivalences to VC and other computation learning frameworks. 

Robust AI systems: Deep reinforcement learning and  PAC

Even though the theory of learning on deep (reinforcement) learning is not established and active area of research. There is an intimate connection with composition of concepts, i.e., binary instance subsets as almost all operations within  deep RL can be viewed as probabilistic Boolean functions (PBFs). 

Conclusion 

Current research and practice in robust AI systems could focus on producing learnable probabilistic boolean functions (PBFs) as intelligent agents, rather than being a human level intelligent agents. This modest purpose might bring more practical fruit than long-term aims of replacing human intelligence. Moreover, theory of computation for deep learning and causality could benefit from this approach. 

Further reading


Tuesday, 5 July 2022

Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra

Preamble

    The White Rabbit
(Wikipedia)

A novice analyst or even experienced (data) scientist would have thought that the bar notation $|$ in representing conditional probability carries some different operational mathematics. Primarily when written in explicit distribution functions $p(x|y)$. Similar approach applies to joint probabilities such as $p(x, y)$ too. One could see a mixture of these, such as $p(x, y | z)$. In this short exposition, we clarify that none of these identifications within arguments of probability do have any different resulting operational meaning. 

Arguments in probabilities: Boolean statement and filtering 

Arguments in any probability are mathematical statements of discrete mathematics that correspond to events in the experimental setting. These are statements declaring some facts with a boolean outcome. These statements are queries to a data set. Such as, if the temperature is above $30$ degrees, $T > 30$. Temperature $T$ is a random variable. Unfortunately, the term random variable is often used differently in many textbooks. It is defined as a mapping rather than as a single variable. The bar $|$ in conditional probability $p(x|y)$, implies statement $x$ given that statement $y$ has already occurred, i.e., if. This interpretation implies that $y$ first occurred before $x$, but it doesn't imply that they are causally linked. The condition plays a role in filtering, a where clause in query languages. $p(x|y)$ boils down to $p_{y}(x)$, where the first statement $y$ is applied to the dataset before computing the probability on the remaining statement $x$.

In the case of joint probabilities $p(x, y)$, events co-occur, i.e., AND statement. In summary, anything in the argument of $p$ is written as a mathematical statement. In the case of assigning a distribution or a functional form to $p$, there is no particular role for conditionals or joints; the modelling approach sets an appropriate structure.

Conditioning does not imply casual direction: do-Calculus do

A filtering interpretation of conditional $p(x|y)$ does not imply causal direction, but $do$ operator does, $p(x|do(y))$. 

Non-commutative algebra: When frequentist are equivalent to Bayesian

Most of the simple filtering operations would result in identical results if reversed. $p(x|y) = p(y|x)$, prior being equal to posterior. This remark implies we can't apply Bayesian learning with commutative statements. We need non-commutative statements; as a result, one can do Bayesian learning with the newly arriving data, i.e., the arrival of new subjective evidence. The reason seems to be due to the frequentist nature of filtering.

Outlook 

Even though we provided some revelations on decoding the operational meaning of conditional probabilities, we suggested that any conditional, joint or any combination of these within the argument of probabilities has no operational purpose other than pre-processing steps. However, the philosophical and practical implications of probabilistic reasoning are always counterintuitive. Probabilistic reasoning is a complex problem computationally. From a causal inference perspective, we are better equipped to tackle these issues with do-Bayesian analysis.  

Further reading

Please Cite as:

 @misc{suezen22brh, 
     title = {Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra}, 
     howpublished = {\url{https://science-memo.blogspot.com/2022/07/bayesian-conditional-noncommutative.html}}, 
     author = {Mehmet Süzen},
     year = {2022}
}  

Monday, 20 June 2022

Empirical risk minimization is not learning :
A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning

    Simionescu Function (Wikipedia)

Preamble

The holy grail of machine learning appears to be the empirical risk minimisation. However, on the contrary to general dogma,  the primary objective of machine learning is not risk minimisation per se but mimicking human or animal learning. Empirical risk minimisation is just a snap-shot in this direction and is part of a learning measure, not the primary objective.

Unfortunately, all current major machine learning libraries are implementing empirical risk minimisation as primary objective, so called a training, manifest as usually .fit. Here we provide a mathematical definition of learning in the language of empirical risk minimisation and its implications on two very important concepts, overfitting and Occam's razor.

Our exposition is still informal but it should be readable for experienced practitioners.

Definition: Empirical Risk Minimization

Given set of $k$ observation $\mathscr{O} = \{o_{1}, ..., o_{k} \}$ where $o_{i} \in \mathbb{R}^{n}$, $n$-dimensional vectors.  Corresponding labels or binary classes, the set $\mathscr{S} = \{ s_{1}, .., s_{k}\}$, with $s_{i} \in \{0,1\}$ is defined. A function $g$  maps observations to classes $g: \mathscr{O} \to \mathscr{S}$.  An error function (or loss) $E$ measures the error made by the estimated map function $\hat{g}$ compare to true map function $g$,  $E=E(\hat{g}, g)$.  The entire idea of supervised machine learning boils down to minimising a functional called ER (Empirical Risk), here we denoted by $G$, it is a functional, meaning is a function of function, over the domain $\mathscr{D} = Tr(\mathscr{O} x \mathscr{S})$ in discrete form, $$ G[E] = \frac{1}{k} {\Large \Sigma}_{\mathscr{D} }  E(\hat{g}, g) $$.  This is so called a training a machine learning model, or an estimation for  $\hat{g}$. However, testing this estimate on the new data is not the main purpose of the learning.

Definition: Learning measure 

A learning measures $M$, on $\hat{g}$ is defined over set of $l$ observations with increasing size, $\Theta = \{ \mathscr{O}_{1}, ..., \mathscr{O}_{l}\}$ whereby size of each set is monotonically higher, meaning that $ | \mathscr{O}_{1}| < | \mathscr{O}_{2}| , ...,< | \mathscr{O}_{l}|$.

Definition: Empirical Risk Minimization with a learning measure (ERL)

Now, we are in a position to reformulate ER with learning measure, we call this ERL. This come with a testing procedure.

If empirical risks $G[E_{j}]$ lowers monotonically, $ G[E_{1}] > G[E_{2}] > ... > G[E_{l}]$, then we said the functional form of $\hat{g}$ is a learning over the set  $\Theta$.  

Functional form of $\hat{g}$ : Inductive bias

The functional form implies a model selection, and a technical term of this also known as inductive bias with other assumptions, meaning the selection of complexity of the model, for example a linear regression or nonlinear regression.

Re-understanding of overfitting and Occam's razor from ERL perspective 

If we have two different ERLs on $\hat{g}^{1}$ and $\hat{g}^{2}$. Then overfitting is a comparison problem between monotonically increasing empirical risks. If model, here an inductive bias or a functional form, over learning measure, we select the one with "higher monotonicity" and the less complex one and call the other overfitted model. Complexity here boils down to functional complexity of $\hat{g}^{1}$ and  $\hat{g}^{2}$  and overfitting can only be tested with two models over monotonicity (increasing) of ERLs.

Conclusions

In the age of deep learning systems, the classical learning theory needs an update on how do we define what is learning beyond a single shot fitting exercise. A first step in this direction would be to improve upon basic definitions of Empirical Risk (ER) minimisation that would reflect real-life learning systems similar to forgetting mechanism proposed by Ebbinghaus. This is consistent with Tom Mitchell's definition of operational machine learning. A next level would be to add causality in the definition.

Please cite as follows:

 @misc{suezen22erm, 
     title = { Empirical risk minimization is not learning : A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning}, 
     howpublished = {\url{http://science-memo.blogspot.com/2022/06/empirical-risk-minimisation-learning-curve.html}}, 
     author = {Mehmet Süzen},
     year = {2022}

}  

Postscript Notes

Following notes are added after initial release 

Postscript 1: Understanding overfitting as comparison of inductive biases

ERM could be confusing for even experienced researchers. It is indeed about risk measure. We measure the risk of a model, i.e., machine learning procedure that how much error would it make on the  given new data distribution, as in risk of investing. This is quite a similar notion as in financial risk of loss but not explicitly stated. 


Moreover, a primary objective of machine learning is not ERM but measure learning curves and pair-wise comparison of  inductive biases, avoiding overfitting.  An inductive bias, here we restrict the concept as in model  type,  is a model selection step: different  parametrisation of the same model are still the same inductive bias.  That’s why standard training-error learning curves can’t be used to detect overfitting alone. 

Postscript 2: Learning is not to optimise: Thermodynamic limit, true risk and accessible learning space

True risk minimisation in machine learning is not possible, instead we rely on ERM, i.e., Emprical Risk Minimisation.  However, the purpose of machine learning algorithm is not to minimise risk, as we only have a  partial knowledge about the reality through data.  Learning implies finding out a region  in accessible learning space whereby there is a monotonic increase in the objective; ERM is only a single point on this space, the concept rooted in German scientist Hermann Ebbinghaus  work on memory.


There is an intimate connection to thermodynamic limit and true risk in this direction as an open research.  However, it doesn’t imply infinite limit of data, but the observable’s behaviour. That’s why full empiricist approaches usually requires a complement of a physical laws,  such as Physics Informed Neural Networks (PINNs) or Structural Causal Model (SCM).

Postscript 3: Missing abstraction in modern machine learning libraries 

Interestingly current modern machine learning libraries stop abstracting further than fitting: .fit and .predict. This is short of learning as in machine learning. Learning manifest itself In learning curves. .learn functionality can be leveraged beyond fitting and if we are learning via monotonically increasing performance. Origin of this lack of tools for .learn appears to be how Empirical Risk Minimisation (ERM) is formulated on a single task.

Wednesday, 11 May 2022

A misconception in ergodicity: Identify ergodic regime not ergodic process

Preamble 

    Figure 1: Two observable's approach to
 ergodicity for Bernoulli Trials. 
Ergodicity appears in many fields, in physics, chemistry and natural sciences but in economics to machine learning as well. Recall that, ergodicity in physics and mathematical definition diverges significantly due to Birkhoff's statistical definition against Boltzmann's physical approach. Here we will follow Birkhoff's definition of ergodicity which is a statistical one. The basic notion of ergodicity is confusing even among experienced  academic circles. The primary misconception is that ergodicity is attributed to a process, a given process being ergodic. We address this by pointing out that ergodicity appears as a regime or a window so to speak for a given process's time-evolution and it can't be attributed to an entire generating process.    

No such thing as ergodic process but ergodic regime given observable

A process being ergodic is not entirely true identification. Ergodicity is a regime over a given time window for a given observable derived from the process. This is the basis of ensemble theory from statistical physics.  Most of the processes generates initially a non-ergodic regime given an observable.  In order to identify an ergodic regime,  we need to define for a discrete setting : 

  1. the ensemble (sample space) : In discrete dynamics we also have an alphabet that ensemble is composed of.
  2. an observable defined over the sample space.
  3. a process (usually dynamics on the sample space evolving over-time).
  4. a measure and threshold to discriminate ergodic to non-ergodic regimes. 
Interesting thing is that different observables on the  same ensemble and the process may generate different ergodic regimes.  

 What are the processes and regime mathematically?

A process is essentially a dynamical system mathematically.  this includes stochastic models and as well as deterministic systems sensitive to initial conditions. Prominently these both combined in Statistical Physics. A regime mathematically implies a range of parameters or a time-window that a system behaves very differently. 

 Identification of ergodic regime

Figure 2: Evolution of
time-averaged OR observable.
The main objective of finding out if dynamics produced by the process on our observable enters or in an ergodic regime is to measure if ensemble-averaged observable is equivalent to time-averaged observable value. Here equivalence is a difficult concept to address quantitatively. The simplest measure would be to check if $\Omega = \langle A \rangle_{ensemble} - \langle A \rangle_{time}$ is close to zero, i.e., vanishing. $ \Omega$ being the ergodicity measure and $A$ is the observable with different averaging procedure. This is the definition we will use here. However, beware that in the physics literature there are more advanced measures to detect ergodicity, such as considering diffusion-like behaviour, meaning that the transition from non-ergodic to ergodic regime is not abrupt but have a diffusing approach to ergodicity.  

Figure 3: Evolution of
time-averaged mean.

In some other academic fields approach to ergodic regime has different names not strictly but closely related, such as in chemical physics or molecular dynamics, equilibration time, relaxation time, equilibrium, steady-state for a given observable, in statistics Monte Carlo simulations, it is usually called burn out period. Not always, but in ergodic regime, observable is stationary and time-independent. In Physics, this is much easier to distinguish because time-dependence, equilibrium and stationarity are tied to energy transfer to the system. 

Ergodic regime not ergodic process : An example of Bernoulli Trials

Apart from real physical processes such as Ising Model, a basic process we can use to understand how ergodic regime could be detected using Bernoulli Trials. 

Here for a Bernoulli trials/process, we will use random number generators for a binary outcome, i.e., RNG Marsenne-Twister  to generate time evolution of an observables on two sites:  Let's say we have two sites  $x, y \in \{1, 0\}$.  The ensemble of this two site system $xy$ is simply the sample space of all possible outcomes $S=\{10, 11, 01, 00\}$. Time evolution of such two site system is formulated here as choosing $\{0,1\}$  for a given site at a given time, see Appendix Python notebook. 

Now, the most important part of checking ergodic regime is that we need to define an observable over two side trials. We denote two observable as $O_{1}$, which is an OR operation between sites, and $O_{2}$ is averaged over two sites. Since our sample space is small, we can compute the ensemble average observables analytically:

  • $O_{1}  =  (x+y)/2$  then  $10, 11, 01, 00  ;  (1/2 + 2/2 + 1/2 + 0 ) /4 = 0.5$
  • $O_{2}  =  x OR y$ then   $10, 11, 01, 00  ;  ( 1  + 1 + 1 + 0 )/4 = 0.75$   

We can compute the time-averaged observables over time via simulations, but their formulation are know as follows: 

  •  Time average for $O_{1}$ at time $t$  (current step)  is   $ \frac{1}{t} \sum_{i=0}^{t} (x_{i}+y_{i})/2.0$
  • Time average for $O_{2}$ at time $t$  (current step)  is   $ \frac{1}{t} \sum_{i=0}^{t} (x_{i} OR y_{i})$.

One of the possible trajectories are shown in Figure 2 and 3. For approach to ergodicity measure, we shown this at Figure 1. Even though, we should run multiple trajectories to have error estimates, we can clearly see that ergodicity regime starts after 10K steps, at least. Moreover, different observables have different decay rates to ergodic regime.  From preliminary simulation, it appears to be OR observable converges slower, though this is a single trajectory.

Conclusion

We have shown that manifestation of the ergodic regime depends on the time-evolution of the observable given a measure of ergodicity, i.e., a condition how ergodicity is detected. This exposition should clarify that a generating process does not get an attribute of "ergodic process" rather we talk about "ergodic regime" depending on observable and the process over temporal evolution. Interestingly, from Physics point of view, it is perfectly possible that an observable attains ergodic regime and then falls back to non-ergodic regime.

Further reading

Appendix: Code

Bernoulli Trial example we discussed is available as a Python notebook on github here

Please cite as follows:

 @misc{suezen22ergoreg, 
     title = {A misconception in ergodicity: Identify ergodic regime not ergodic process}, 
     howpublished = {\url{http://science-memo.blogspot.com/2022/05/ergodic-regime-not-process.html}, 
     author = {Mehmet Süzen},
     year = {2022}
}  
(c) Copyright 2008-2024 Mehmet Suzen (suzen at acm dot org)

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.