Monday, 20 June 2022

Empirical risk minimization is not learning : A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning

 Simionescu Function (Wikipedia)

Preamble

The holy grail of machine learning appears to be the empirical risk minimisation. However, on the contrary to general dogma,  the primary objective of machine learning is not risk minimisation per se but mimicking human or animal learning. Empirical risk minimisation is just a snap-shot in this direction and is part of a learning measure, not the primary objective.

Unfortunately, all current major machine learning libraries are implementing empirical risk minimisation as primary objective, so called a training, manifest as usually .fit. Here we provide a mathematical definition of learning in the language of empirical risk minimisation and its implications on two very important concepts, overfitting and Occam's razor.

Our exposition is still informal but it should be readable for experienced practitioners.

Definition: Empirical Risk Minimization

Given set of $k$ observation $\mathscr{O} = \{o_{1}, ..., o_{k} \}$ where $o_{i} \in \mathbb{R}^{n}$, $n$-dimensional vectors.  Corresponding labels or binary classes, the set $\mathscr{S} = \{ s_{1}, .., s_{k}\}$, with $s_{i} \in \{0,1\}$ is defined. A function $g$  maps observations to classes $g: \mathscr{O} \to \mathscr{S}$.  An error function (or loss) $E$ measures the error made by the estimated map function $\hat{g}$ compare to true map function $g$,  $E=E(\hat{g}, g)$.  The entire idea of supervised machine learning boils down to minimising a functional called ER (Empirical Risk), here we denoted by $G$, it is a functional, meaning is a function of function, over the domain $\mathscr{D} = Tr(\mathscr{O} x \mathscr{S})$ in discrete form, $$G[E] = \frac{1}{k} {\Large \Sigma}_{\mathscr{D} } E(\hat{g}, g)$$.  This is so called a training a machine learning model, or an estimation for  $\hat{g}$. However, testing this estimate on the new data is not the main purpose of the learning.

Definition: Learning measure

A learning measures $M$, on $\hat{g}$ is defined over set of $l$ observations with increasing size, $\Theta = \{ \mathscr{O}_{1}, ..., \mathscr{O}_{l}\}$ whereby size of each set is monotonically higher, meaning that $| \mathscr{O}_{1}| < | \mathscr{O}_{2}| , ...,< | \mathscr{O}_{l}|$.

Definition: Empirical Risk Minimization with a learning measure (ERL)

Now, we are in a position to reformulate ER with learning measure, we call this ERL. This come with a testing procedure.

If empirical risks $E_{j}$ lowers monotonically, $E_{1} > E_{2} > ... > E_{l}$, then we said the functional form of $\hat{g}$ is a learning over the set  $\Theta$.

Functional form of $\hat{g}$ : Inductive bias

The functional form implies a model selection, and a technical term of this also known as inductive bias with other assumptions, meaning the selection of complexity of the model, for example a linear regression or nonlinear regression.

Re-understanding of overfitting and Occam's razor from ERL perspective

If we have two different ERLs on $\hat{g}^{1}$ and $\hat{g}^{2}$. Then overfitting is a comparison problem between monotonically increasing empirical risks. If model, here an inductive bias or a functional form, over learning measure, we select the one with "higher monotonicity" and the less complex one and call the other overfitted model. Complexity here boils down to functional complexity of $\hat{g}^{1}$ and  $\hat{g}^{2}$  and overfitting can only be tested with two models over monotonicity (increasing) of ERLs.

Conclusions

In the age of deep learning systems, the classical learning theory needs an update on how do we define what is learning beyond a single shot fitting exercise. A first step in this direction would be to improve upon basic definitions of Empirical Risk (ER) minimisation that would reflect real-life learning systems similar to forgetting mechanism proposed by Ebbinghaus. This is consistent with Tom Mitchell's definition of operational machine learning. A next level would be to add causality in the definition.

Wednesday, 11 May 2022

A misconception in ergodicity: Identify ergodic regime not ergodic process

Preamble

 Figure 1: Two observable's approach to ergodicity for Bernoulli Trials.
Ergodicity appears in many fields, in physics, chemistry and natural sciences but in economics to machine learning as well. Recall that, ergodicity in physics and mathematical definition diverges significantly due to Birkhoff's statistical definition against Boltzmann's physical approach. Here we will follow Birkhoff's definition of ergodicity which is a statistical one. The basic notion of ergodicity is confusing even among experienced  academic circles. The primary misconception is that ergodicity is attributed to a process, a given process being ergodic. We address this by pointing out that ergodicity appears as a regime or a window so to speak for a given process's time-evolution and it can't be attributed to an entire generating process.

No such thing as ergodic process but ergodic regime given observable

A process being ergodic is not entirely true identification. Ergodicity is a regime over a given time window for a given observable derived from the process. This is the basis of ensemble theory from statistical physics.  Most of the processes generates initially a non-ergodic regime given an observable.  In order to identify an ergodic regime,  we need to define for a discrete setting :

1. the ensemble (sample space) : In discrete dynamics we also have an alphabet that ensemble is composed of.
2. an observable defined over the sample space.
3. a process (usually dynamics on the sample space evolving over-time).
4. a measure and threshold to discriminate ergodic to non-ergodic regimes.
Interesting thing is that different observables on the  same ensemble and the process may generate different ergodic regimes.

What are the processes and regime mathematically?

A process is essentially a dynamical system mathematically.  this includes stochastic models and as well as deterministic systems sensitive to initial conditions. Prominently these both combined in Statistical Physics. A regime mathematically implies a range of parameters or a time-window that a system behaves very differently.

Identification of ergodic regime

 Figure 2: Evolution of time-averaged OR observable.
The main objective of finding out if dynamics produced by the process on our observable enters or in an ergodic regime is to measure if ensemble-averaged observable is equivalent to time-averaged observable value. Here equivalence is a difficult concept to address quantitatively. The simplest measure would be to check if $\Omega = \langle A \rangle_{ensemble} - \langle A \rangle_{time}$ is close to zero, i.e., vanishing. $\Omega$ being the ergodicity measure and $A$ is the observable with different averaging procedure. This is the definition we will use here. However, beware that in the physics literature there are more advanced measures to detect ergodicity, such as considering diffusion-like behaviour, meaning that the transition from non-ergodic to ergodic regime is not abrupt but have a diffusing approach to ergodicity.

 Figure 3: Evolution of time-averaged mean.
In some other academic fields approach to ergodic regime has different names not strictly but closely related, such as in chemical physics or molecular dynamics, equilibration time, relaxation time, equilibrium, steady-state for a given observable, in statistics Monte Carlo simulations, it is usually called burn out period. Not always, but in ergodic regime, observable is stationary and time-independent. In Physics, this is much easier to distinguish because time-dependence, equilibrium and stationarity are tied to energy transfer to the system.

Ergodic regime not ergodic process : An example of Bernoulli Trials

Apart from real physical processes such as Ising Model, a basic process we can use to understand how ergodic regime could be detected using Bernoulli Trials.

Here for a Bernoulli trials/process, we will use random number generators for a binary outcome, i.e., RNG Marsenne-Twister  to generate time evolution of an observables on two sites:  Let's say we have two sites  $x, y \in \{1, 0\}$.  The ensemble of this two site system $xy$ is simply the sample space of all possible outcomes $S=\{10, 11, 01, 00\}$. Time evolution of such two site system is formulated here as choosing $\{0,1\}$  for a given site at a given time, see Appendix Python notebook.

Now, the most important part of checking ergodic regime is that we need to define an observable over two side trials. We denote two observable as $O_{1}$, which is an OR operation between sites, and $O_{2}$ is averaged over two sites. Since our sample space is small, we can compute the ensemble average observables analytically:

• $O_{1} = (x+y)/2$  then  $10, 11, 01, 00 ; (1/2 + 2/2 + 1/2 + 0 ) /4 = 0.5$
• $O_{2} = x OR y$ then   $10, 11, 01, 00 ; ( 1 + 1 + 1 + 0 )/4 = 0.75$

We can compute the time-averaged observables over time via simulations, but their formulation are know as follows:

•  Time average for $O_{1}$ at time $t$  (current step)  is   $\frac{1}{t} \sum_{i=0}^{t} (x_{i}+y_{i})/2.0$
• Time average for $O_{2}$ at time $t$  (current step)  is   $\frac{1}{t} \sum_{i=0}^{t} (x_{i} OR y_{i})$.

One of the possible trajectories are shown in Figure 2 and 3. For approach to ergodicity measure, we shown this at Figure 1. Even though, we should run multiple trajectories to have error estimates, we can clearly see that ergodicity regime starts after 10K steps, at least. Moreover, different observables have different decay rates to ergodic regime.  From preliminary simulation, it appears to be OR observable converges slower, though this is a single trajectory.

Conclusion

We have shown that manifestation of the ergodic regime depends on the time-evolution of the observable given a measure of ergodicity, i.e., a condition how ergodicity is detected. This exposition should clarify that a generating process does not get an attribute of "ergodic process" rather we talk about "ergodic regime" depending on observable and the process over temporal evolution. Interestingly, from Physics point of view, it is perfectly possible that an observable attains ergodic regime and then falls back to non-ergodic regime.

Appendix: Code

Bernoulli Trial example we discussed is available as a Python notebook on github here

Friday, 11 February 2022

Physics origins of the most important statistical ideas of recent times

 Figure: Maxwell's handwritings, state diagram (Wikipedia)

Preamble

The modern statistics now move into an emerging field called data science that amalgamate many different fields from high performance computing to control engineering. However, the emergent behaviour from researchers in machine learning and statistics that, sometimes they omit naïvely and probably unknowingly the fact that some of the most important ideas in data sciences are actually originated from Physics discoveries and specifically developed by physicist. In this short exposition we try to review these physics origins on the areas defined by Gelman and Vehtari (doi). Additional section is also added in other possible areas that are currently the focus of active research in data sciences.

Bootstrapping and simulation based inference : Gibbs's Ensemble theory and Metropolis's simulations

Bootstrapping is a novel idea of estimations with uncertainty with given set of samples. It is mostly popularised by Efron and his contribution is immense, making this tool available to all researchers doing quantitative analysis.  However, the origins of bootstrapping can be traced back to the idea of ensembles in statistical physics, which is introduced by J. Gibbs. The ensembles in physics allow us to do just what bootstrapping helps, estimating a quantity of interest with sub-sampling, in the case of statistical physics this appears as sampling a set of different microstates. Using this idea Metropolis devised a inference in 1953, to compute ensemble averages for liquids using computers.  Note that, usage of Monte Carlo approach for pure mathematical nature, i.e., solving integrals, appear much earlier with von Neumann's efforts.

Causality : Hamiltonian systems to Thermodynamic potentials

 Figure: Maxwell Relations as causal diagrams.
Even though the historical roots of causal analysis in early 20th century attributed to Wright 1923 for his definition of path analysis, causality was the core tanents of Newtonian mechanics in distinguishing left and right of the equations of motions in the form of differential equations, and the set of differential equations following that with Hamiltonian Mechanics is actually forms a graph, i.e., relationships between generalised coordinates, momentum and positions. This connection is never acknowledge in early statistical literature, and probably causal constructions from classical physics were not well known in that community or did not find its way to data-driven mechanics. Similarly, causal construction of thermodynamic potentials appear as a directed graph as in, Born wheel. It appears as a mnemonic but it is actually  causally constructed via Legendre Transformations.  Of course, causality, philosophically speaking, is discussed since Ancient Greece but here we restrict the discussion on solely quantitative theories after Newton.

Overparametrised models and regularisation : Poincaré classifications and astrophysical dynamics

The current deep learning systems classified as massively overparametrized systems. However, the lower dimensional understanding of this phenomenon were well studied by Poincare's classification of classical dynamics, namely the measurement problem of having overdetermined system of differential equations, i.e., whereby inverse problems are well known in astrophysics and theoretical mechanics.

High-performance computing: Big-data to GPUs

Similarly, using supercomputers or as now we call it high-performance computation with big data generating processes were actually can be traced back to Manhattan project and ENIAC that aims solving scattering equations and almost 50 years of development on this direction before 2000s.

Conclusion

The impressive development of new emergent field of data science as a larger perspective of statistics into computer science have strong origins from core Physics literature and research. These connections are not sufficiently cited or acknowledged. Our aim in this short exposition is to bring these aspects into the attention of data science practitioners and researchers alike.

Some of the mentioned works and related reading list, papers or books.

Appendix: Pearson correlation and Lattices

Auguste Bravais is famous for his contribution in foundational work on the mathematical theory for crystallography, now seems to be going far beyond periodic solids. Unknown to many, he actually first driven the expression for what we know today as correlation coefficient or  Pearson’s correlation or less commonly Pearson-Galton coefficient. Interestingly, one of the grandfathers of causal analysis Wright is mentioned this in his seminal work of 1921 titled “Correlation and causation” acknowledged Bravais for his 1849 work as the first derivation of correlation.

Monday, 15 November 2021

Periodic Spectral Ergodicity Accurately Predicts Deep Learning Generalisation

Preamble

One of the new mathematical concepts arise due to understanding of deep learning is called periodic spectral ergodicity (PSE). The cascading PSE (cPSE) propagates over deep learning layers which can also be used as a complexity measure. cPSE actually can also predict the generalisation ability. In this post, we review this interesting  finding in an easy and short manner.

How periodic spectral ergodicity cascades over layers

 Figure: Replicate eigenvalue bins of the smaller layer matrices.
We have reviewed spectral ergodicity in a gentle fashion earlier, here.  Only difference is that in real deep learning architectures, length of the eigenvalue spectrum, i.e., the number  of bins in the histogram, generated by weight matrices are not equal in size. To align them, we use something called periodic boundary conditions or turn the eigenvalues in a cyclic fashion, up to the maximum length spectra we have seen up to that layer. Here are the steps that give, the intuition of how to compute cascading periodic spectral ergodicity (cPSE).

1. We compute eigenvalue spectrum up to a layer $i$ and align the smaller spectrum with periodic boundary conditions, i.e., cyclic.

2. Compute spectral ergodicity at layers $i$ and $i-1$.

3. Compute the cascading PSE at layer $i$ simply with a distance metric $\Omega^{i}$  and $\Omega^{i-1}$. i.e.,  KL divergence in two directions, recall earlier tutorials.

If we repeat this up to the last layer, cPSE measures the complexity of the deep learning architecture, both capturing structural and learning algorithm-wise, in a depth of a layer fashion.

Generalisation Gap and cPSE

Apart from being a complexity measure, cPSE predicts the generalisation gap given reference architecture i.e., it correlates with the performance almost perfectly. These findings are presented in the paper suzen2019 .

Conclusions and Outlook

The complexity of deep learning architectures are still an open research problem.  One of the most promising direction is to use cPSE in terms of capturing structural complexity as well. While other measures in the literature did not consider depth dependency, whereby cPSE appears to be the first one.

Reference

@article{suzen2019,
title={Periodic Spectral Ergodicity: A Complexity Measure for Deep Neural Networks and Neural Architecture Search},
author={S{\"u}zen, Mehmet and Cerd{\a}, Joan J and Weber, Cornelius},
journal={arXiv preprint arXiv:1911.07831},
year={2019}
}`

Cite this post as  Periodic Spectral Ergodicity Accurately Predicts Deep Learning Generalisation, Mehmet Süzen,  https://science-memo.blogspot.com/2021/11/periodic-spectral-ergodicity-predicts-generalisation-deep-learning.html 2021