Simionescu Function (Wikipedia) |

Preamble

Preamble

*empirical risk minimisation*. However, on the contrary to general dogma, the primary objective of machine learning is not

*risk minimisation per se*but mimicking human or animal learning. Empirical risk minimisation is just a snap-shot in this direction and is part of a learning measure, not the primary objective.

**Definition: Empirical Risk Minimization**

Given set of $k$ observation $\mathscr{O} = \{o_{1}, ..., o_{k} \}$ where $o_{i} \in \mathbb{R}^{n}$, $n$-dimensional vectors. Corresponding labels or binary classes, the set $\mathscr{S} = \{ s_{1}, .., s_{k}\}$, with $s_{i} \in \{0,1\}$ is defined. A function $g$ maps observations to classes $g: \mathscr{O} \to \mathscr{S}$. An error function (or loss) $E$ measures the error made by the estimated map function $\hat{g}$ compare to true map function $g$, $E=E(\hat{g}, g)$. The entire idea of supervised machine learning boils down to minimising a functional called ER (Empirical Risk), here we denoted by $G$, it is a functional, meaning is a function of function, over the domain $\mathscr{D} = Tr(\mathscr{O} x \mathscr{S})$ in discrete form, $$ G[E] = \frac{1}{k} {\Large \Sigma}_{\mathscr{D} } E(\hat{g}, g) $$. This is so called a training a machine learning model, or an estimation for $\hat{g}$. However, testing this estimate on the new data is *not* the main purpose of the learning.

**Definition: Learning measure **

A learning measures $M$, on $\hat{g}$ is defined over set of $l$ observations with increasing size, $\Theta = \{ \mathscr{O}_{1}, ..., \mathscr{O}_{l}\}$ whereby size of each set is monotonically higher, meaning that $ | \mathscr{O}_{1}| < | \mathscr{O}_{2}| , ...,< | \mathscr{O}_{l}|$.

**Definition: Empirical Risk Minimization with a learning measure (ERL)**

Now, we are in a position to reformulate ER with learning measure, we call this ERL. This come with a testing procedure.

If empirical risks $E_{j}$ lowers monotonically, $ E_{1} > E_{2} > ... > E_{l}$, then we said the functional form of $\hat{g}$ is a learning over the set $\Theta$.

**Functional form of $\hat{g}$ : Inductive bias**

The functional form implies a model selection, and a technical term of this also known as inductive bias with other assumptions, meaning the selection of complexity of the model, for example a linear regression or nonlinear regression.

**Re-understanding of overfitting and Occam's razor from ERL perspective **

If we have two different ERLs on $\hat{g}^{1}$ and $\hat{g}^{2}$. Then overfitting is a comparison problem between monotonically increasing empirical risks. If model, here an inductive bias or a functional form, over learning measure, we select the one with "higher monotonicity" and the less complex one and call the other overfitted model. Complexity here boils down to functional complexity of $\hat{g}^{1}$ and $\hat{g}^{2}$ and overfitting can only be tested with two models over monotonicity (increasing) of ERLs.

**Conclusions**

In the age of deep learning systems, the classical learning theory needs an update on how do we define what is learning beyond a single shot fitting exercise. A first step in this direction would be to improve upon basic definitions of Empirical Risk (ER) minimisation that would reflect real-life learning systems similar to forgetting mechanism proposed by Ebbinghaus. This is consistent with Tom Mitchell's definition of operational machine learning. A next level would be to add causality in the definition.

**Postscript: Understanding overfitting as comparison of inductive biases**

ERM could be confusing for even experienced researchers. It is indeed about risk measure. We measure the risk of a model, i.e., machine learning procedure that how much error would it make on the given new data distribution, as in risk of investing. This is quite a similar notion as in financial risk of loss but not explicitly stated.

Moreover, a primary objective of machine learning is not ERM but measure learning curves and pair-wise comparison of inductive biases, avoiding overfitting. An inductive bias, here we restrict the concept as in model type, is a model selection step: different parametrisation of the same model are still the same inductive bias. That’s why standard training-error learning curves can’t be used to detect overfitting alone.

**Postscript Learning is not to optimise : Thermodynamic limit, true risk and accessible learning space**

True risk minimisation in machine learning is not possible, instead we rely on ERM, i.e., Emprical Risk Minimisation. However, the purpose of machine learning algorithm is not to minimise risk, as we only have a partial knowledge about the reality through data. Learning implies finding out a region in accessible learning space whereby there is a monotonic increase in the objective; ERM is only a single point on this space, the concept rooted in German scientist Hermann Ebbinghaus work on memory.

There is an intimate connection to thermodynamic limit and true risk in this direction as an open research. However, it doesn’t imply infinite limit of data, but the observable’s behaviour. That’s why full empiricist approaches usually requires a complement of a physical laws, such as Physics Informed Neural Networks (PINNs) or Structural Causal Model (SCM).