Monday 20 June 2022

Empirical risk minimization is not learning :
A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning

    Simionescu Function (Wikipedia)

Preamble

The holy grail of machine learning appears to be the empirical risk minimisation. However, on the contrary to general dogma,  the primary objective of machine learning is not risk minimisation per se but mimicking human or animal learning. Empirical risk minimisation is just a snap-shot in this direction and is part of a learning measure, not the primary objective.

Unfortunately, all current major machine learning libraries are implementing empirical risk minimisation as primary objective, so called a training, manifest as usually .fit. Here we provide a mathematical definition of learning in the language of empirical risk minimisation and its implications on two very important concepts, overfitting and Occam's razor.

Our exposition is still informal but it should be readable for experienced practitioners.

Definition: Empirical Risk Minimization

Given set of $k$ observation $\mathscr{O} = \{o_{1}, ..., o_{k} \}$ where $o_{i} \in \mathbb{R}^{n}$, $n$-dimensional vectors.  Corresponding labels or binary classes, the set $\mathscr{S} = \{ s_{1}, .., s_{k}\}$, with $s_{i} \in \{0,1\}$ is defined. A function $g$  maps observations to classes $g: \mathscr{O} \to \mathscr{S}$.  An error function (or loss) $E$ measures the error made by the estimated map function $\hat{g}$ compare to true map function $g$,  $E=E(\hat{g}, g)$.  The entire idea of supervised machine learning boils down to minimising a functional called ER (Empirical Risk), here we denoted by $G$, it is a functional, meaning is a function of function, over the domain $\mathscr{D} = Tr(\mathscr{O} x \mathscr{S})$ in discrete form, $$ G[E] = \frac{1}{k} {\Large \Sigma}_{\mathscr{D} }  E(\hat{g}, g) $$.  This is so called a training a machine learning model, or an estimation for  $\hat{g}$. However, testing this estimate on the new data is not the main purpose of the learning.

Definition: Learning measure 

A learning measures $M$, on $\hat{g}$ is defined over set of $l$ observations with increasing size, $\Theta = \{ \mathscr{O}_{1}, ..., \mathscr{O}_{l}\}$ whereby size of each set is monotonically higher, meaning that $ | \mathscr{O}_{1}| < | \mathscr{O}_{2}| , ...,< | \mathscr{O}_{l}|$.

Definition: Empirical Risk Minimization with a learning measure (ERL)

Now, we are in a position to reformulate ER with learning measure, we call this ERL. This come with a testing procedure.

If empirical risks $G[E_{j}]$ lowers monotonically, $ G[E_{1}] > G[E_{2}] > ... > G[E_{l}]$, then we said the functional form of $\hat{g}$ is a learning over the set  $\Theta$.  

Functional form of $\hat{g}$ : Inductive bias

The functional form implies a model selection, and a technical term of this also known as inductive bias with other assumptions, meaning the selection of complexity of the model, for example a linear regression or nonlinear regression.

Re-understanding of overfitting and Occam's razor from ERL perspective 

If we have two different ERLs on $\hat{g}^{1}$ and $\hat{g}^{2}$. Then overfitting is a comparison problem between monotonically increasing empirical risks. If model, here an inductive bias or a functional form, over learning measure, we select the one with "higher monotonicity" and the less complex one and call the other overfitted model. Complexity here boils down to functional complexity of $\hat{g}^{1}$ and  $\hat{g}^{2}$  and overfitting can only be tested with two models over monotonicity (increasing) of ERLs.

Conclusions

In the age of deep learning systems, the classical learning theory needs an update on how do we define what is learning beyond a single shot fitting exercise. A first step in this direction would be to improve upon basic definitions of Empirical Risk (ER) minimisation that would reflect real-life learning systems similar to forgetting mechanism proposed by Ebbinghaus. This is consistent with Tom Mitchell's definition of operational machine learning. A next level would be to add causality in the definition.

Please cite as follows:

 @misc{suezen22erm, 
     title = { Empirical risk minimization is not learning : A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning}, 
     howpublished = {\url{http://science-memo.blogspot.com/2022/06/empirical-risk-minimisation-learning-curve.html}}, 
     author = {Mehmet Süzen},
     year = {2022}

}  

Postscript Notes

Following notes are added after initial release 

Postscript 1: Understanding overfitting as comparison of inductive biases

ERM could be confusing for even experienced researchers. It is indeed about risk measure. We measure the risk of a model, i.e., machine learning procedure that how much error would it make on the  given new data distribution, as in risk of investing. This is quite a similar notion as in financial risk of loss but not explicitly stated. 


Moreover, a primary objective of machine learning is not ERM but measure learning curves and pair-wise comparison of  inductive biases, avoiding overfitting.  An inductive bias, here we restrict the concept as in model  type,  is a model selection step: different  parametrisation of the same model are still the same inductive bias.  That’s why standard training-error learning curves can’t be used to detect overfitting alone. 

Postscript 2: Learning is not to optimise: Thermodynamic limit, true risk and accessible learning space

True risk minimisation in machine learning is not possible, instead we rely on ERM, i.e., Emprical Risk Minimisation.  However, the purpose of machine learning algorithm is not to minimise risk, as we only have a  partial knowledge about the reality through data.  Learning implies finding out a region  in accessible learning space whereby there is a monotonic increase in the objective; ERM is only a single point on this space, the concept rooted in German scientist Hermann Ebbinghaus  work on memory.


There is an intimate connection to thermodynamic limit and true risk in this direction as an open research.  However, it doesn’t imply infinite limit of data, but the observable’s behaviour. That’s why full empiricist approaches usually requires a complement of a physical laws,  such as Physics Informed Neural Networks (PINNs) or Structural Causal Model (SCM).

Postscript 3: Missing abstraction in modern machine learning libraries 

Interestingly current modern machine learning libraries stop abstracting further than fitting: .fit and .predict. This is short of learning as in machine learning. Learning manifest itself In learning curves. .learn functionality can be leveraged beyond fitting and if we are learning via monotonically increasing performance. Origin of this lack of tools for .learn appears to be how Empirical Risk Minimisation (ERM) is formulated on a single task.