Monday, 20 June 2022

Empirical risk minimization is not learning : A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning

 Simionescu Function (Wikipedia)

Preamble

The holy grail of machine learning appears to be the empirical risk minimisation. However, on the contrary to general dogma,  the primary objective of machine learning is not risk minimisation per se but mimicking human or animal learning. Empirical risk minimisation is just a snap-shot in this direction and is part of a learning measure, not the primary objective.

Unfortunately, all current major machine learning libraries are implementing empirical risk minimisation as primary objective, so called a training, manifest as usually .fit. Here we provide a mathematical definition of learning in the language of empirical risk minimisation and its implications on two very important concepts, overfitting and Occam's razor.

Our exposition is still informal but it should be readable for experienced practitioners.

Definition: Empirical Risk Minimization

Given set of $k$ observation $\mathscr{O} = \{o_{1}, ..., o_{k} \}$ where $o_{i} \in \mathbb{R}^{n}$, $n$-dimensional vectors.  Corresponding labels or binary classes, the set $\mathscr{S} = \{ s_{1}, .., s_{k}\}$, with $s_{i} \in \{0,1\}$ is defined. A function $g$  maps observations to classes $g: \mathscr{O} \to \mathscr{S}$.  An error function (or loss) $E$ measures the error made by the estimated map function $\hat{g}$ compare to true map function $g$,  $E=E(\hat{g}, g)$.  The entire idea of supervised machine learning boils down to minimising a functional called ER (Empirical Risk), here we denoted by $G$, it is a functional, meaning is a function of function, over the domain $\mathscr{D} = Tr(\mathscr{O} x \mathscr{S})$ in discrete form, $$G[E] = \frac{1}{k} {\Large \Sigma}_{\mathscr{D} } E(\hat{g}, g)$$.  This is so called a training a machine learning model, or an estimation for  $\hat{g}$. However, testing this estimate on the new data is not the main purpose of the learning.

Definition: Learning measure

A learning measures $M$, on $\hat{g}$ is defined over set of $l$ observations with increasing size, $\Theta = \{ \mathscr{O}_{1}, ..., \mathscr{O}_{l}\}$ whereby size of each set is monotonically higher, meaning that $| \mathscr{O}_{1}| < | \mathscr{O}_{2}| , ...,< | \mathscr{O}_{l}|$.

Definition: Empirical Risk Minimization with a learning measure (ERL)

Now, we are in a position to reformulate ER with learning measure, we call this ERL. This come with a testing procedure.

If empirical risks $E_{j}$ lowers monotonically, $E_{1} > E_{2} > ... > E_{l}$, then we said the functional form of $\hat{g}$ is a learning over the set  $\Theta$.

Functional form of $\hat{g}$ : Inductive bias

The functional form implies a model selection, and a technical term of this also known as inductive bias with other assumptions, meaning the selection of complexity of the model, for example a linear regression or nonlinear regression.

Re-understanding of overfitting and Occam's razor from ERL perspective

If we have two different ERLs on $\hat{g}^{1}$ and $\hat{g}^{2}$. Then overfitting is a comparison problem between monotonically increasing empirical risks. If model, here an inductive bias or a functional form, over learning measure, we select the one with "higher monotonicity" and the less complex one and call the other overfitted model. Complexity here boils down to functional complexity of $\hat{g}^{1}$ and  $\hat{g}^{2}$  and overfitting can only be tested with two models over monotonicity (increasing) of ERLs.

Conclusions

In the age of deep learning systems, the classical learning theory needs an update on how do we define what is learning beyond a single shot fitting exercise. A first step in this direction would be to improve upon basic definitions of Empirical Risk (ER) minimisation that would reflect real-life learning systems similar to forgetting mechanism proposed by Ebbinghaus. This is consistent with Tom Mitchell's definition of operational machine learning. A next level would be to add causality in the definition.

Postscript: Understanding overfitting as comparison of inductive biases

ERM could be confusing for even experienced researchers. It is indeed about risk measure. We measure the risk of a model, i.e., machine learning procedure that how much error would it make on the  given new data distribution, as in risk of investing. This is quite a similar notion as in financial risk of loss but not explicitly stated.

Moreover, a primary objective of machine learning is not ERM but measure learning curves and pair-wise comparison of  inductive biases, avoiding overfitting.  An inductive bias, here we restrict the concept as in model  type,  is a model selection step: different  parametrisation of the same model are still the same inductive bias.  That’s why standard training-error learning curves can’t be used to detect overfitting alone.

Postscript Learning is not to optimise : Thermodynamic limit, true risk and accessible learning space

True risk minimisation in machine learning is not possible, instead we rely on ERM, i.e., Emprical Risk Minimisation.  However, the purpose of machine learning algorithm is not to minimise risk, as we only have a  partial knowledge about the reality through data.  Learning implies finding out a region  in accessible learning space whereby there is a monotonic increase in the objective; ERM is only a single point on this space, the concept rooted in German scientist Hermann Ebbinghaus  work on memory.

There is an intimate connection to thermodynamic limit and true risk in this direction as an open research.  However, it doesn’t imply infinite limit of data, but the observable’s behaviour. That’s why full empiricist approaches usually requires a complement of a physical laws,  such as Physics Informed Neural Networks (PINNs) or Structural Causal Model (SCM).