Scientific Memo

Showing posts with label optimization. Show all posts

Monday, 20 June 2022

Empirical risk minimization is not learning :
A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning

Simionescu Function (Wikipedia)

Preamble

The holy grail of machine learning appears to be the empirical risk minimisation. However, on the contrary to general dogma, the primary objective of machine learning is not risk minimisation per se but mimicking human or animal learning. Empirical risk minimisation is just a snap-shot in this direction and is part of a learning measure, not the primary objective.

Unfortunately, all current major machine learning libraries are implementing empirical risk minimisation as primary objective, so called a training, manifest as usually .fit. Here we provide a mathematical definition of learning in the language of empirical risk minimisation and its implications on two very important concepts, overfitting and Occam's razor.

Our exposition is still informal but it should be readable for experienced practitioners.

Definition: Empirical Risk Minimization

Given set of $k$ observation $\mathscr{O} = \{o_{1}, ..., o_{k} \}$ where $o_{i} \in \mathbb{R}^{n}$, $n$-dimensional vectors. Corresponding labels or binary classes, the set $\mathscr{S} = \{ s_{1}, .., s_{k}\}$, with $s_{i} \in \{0,1\}$ is defined. A function $g$ maps observations to classes $g: \mathscr{O} \to \mathscr{S}$. An error function (or loss) $E$ measures the error made by the estimated map function $\hat{g}$ compare to true map function $g$, $E=E(\hat{g}, g)$. The entire idea of supervised machine learning boils down to minimising a functional called ER (Empirical Risk), here we denoted by $G$, it is a functional, meaning is a function of function, over the domain $\mathscr{D} = Tr(\mathscr{O} x \mathscr{S})$ in discrete form, $$ G[E] = \frac{1}{k} {\Large \Sigma}_{\mathscr{D} } E(\hat{g}, g) $$. This is so called a training a machine learning model, or an estimation for $\hat{g}$. However, testing this estimate on the new data is not the main purpose of the learning.

Definition: Learning measure

A learning measures $M$, on $\hat{g}$ is defined over set of $l$ observations with increasing size, $\Theta = \{ \mathscr{O}_{1}, ..., \mathscr{O}_{l}\}$ whereby size of each set is monotonically higher, meaning that $ | \mathscr{O}_{1}| < | \mathscr{O}_{2}| , ...,< | \mathscr{O}_{l}|$.

Definition: Empirical Risk Minimization with a learning measure (ERL)

Now, we are in a position to reformulate ER with learning measure, we call this ERL. This come with a testing procedure.

If empirical risks $G[E_{j}]$ lowers monotonically, $ G[E_{1}] > G[E_{2}] > ... > G[E_{l}]$, then we said the functional form of $\hat{g}$ is a learning over the set $\Theta$.

Functional form of $\hat{g}$ : Inductive bias

The functional form implies a model selection, and a technical term of this also known as inductive bias with other assumptions, meaning the selection of complexity of the model, for example a linear regression or nonlinear regression.

Re-understanding of overfitting and Occam's razor from ERL perspective

If we have two different ERLs on $\hat{g}^{1}$ and $\hat{g}^{2}$. Then overfitting is a comparison problem between monotonically increasing empirical risks. If model, here an inductive bias or a functional form, over learning measure, we select the one with "higher monotonicity" and the less complex one and call the other overfitted model. Complexity here boils down to functional complexity of $\hat{g}^{1}$ and $\hat{g}^{2}$ and overfitting can only be tested with two models over monotonicity (increasing) of ERLs.

Conclusions

In the age of deep learning systems, the classical learning theory needs an update on how do we define what is learning beyond a single shot fitting exercise. A first step in this direction would be to improve upon basic definitions of Empirical Risk (ER) minimisation that would reflect real-life learning systems similar to forgetting mechanism proposed by Ebbinghaus. This is consistent with Tom Mitchell's definition of operational machine learning. A next level would be to add causality in the definition.

Please cite as follows:

@misc{suezen22erm,

title = { Empirical risk minimization is not learning : A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning},

howpublished = {\url{http://science-memo.blogspot.com/2022/06/empirical-risk-minimisation-learning-curve.html}},

author = {Mehmet Süzen},

year = {2022}

}

Postscript Notes

Following notes are added after initial release

Postscript 1: Understanding overfitting as comparison of inductive biases

ERM could be confusing for even experienced researchers. It is indeed about risk measure. We measure the risk of a model, i.e., machine learning procedure that how much error would it make on the given new data distribution, as in risk of investing. This is quite a similar notion as in financial risk of loss but not explicitly stated.

Moreover, a primary objective of machine learning is not ERM but measure learning curves and pair-wise comparison of inductive biases, avoiding overfitting. An inductive bias, here we restrict the concept as in model type, is a model selection step: different parametrisation of the same model are still the same inductive bias. That’s why standard training-error learning curves can’t be used to detect overfitting alone.

Postscript 2: Learning is not to optimise: Thermodynamic limit, true risk and accessible learning space

True risk minimisation in machine learning is not possible, instead we rely on ERM, i.e., Emprical Risk Minimisation. However, the purpose of machine learning algorithm is not to minimise risk, as we only have a partial knowledge about the reality through data. Learning implies finding out a region in accessible learning space whereby there is a monotonic increase in the objective; ERM is only a single point on this space, the concept rooted in German scientist Hermann Ebbinghaus work on memory.

There is an intimate connection to thermodynamic limit and true risk in this direction as an open research. However, it doesn’t imply infinite limit of data, but the observable’s behaviour. That’s why full empiricist approaches usually requires a complement of a physical laws, such as Physics Informed Neural Networks (PINNs) or Structural Causal Model (SCM).

Postscript 3: Missing abstraction in modern machine learning libraries

Interestingly current modern machine learning libraries stop abstracting further than fitting: .fit and .predict. This is short of learning as in machine learning. Learning manifest itself In learning curves. .learn functionality can be leveraged beyond fitting and if we are learning via monotonically increasing performance. Origin of this lack of tools for .learn appears to be how Empirical Risk Minimisation (ERM) is formulated on a single task.

Saturday, 20 March 2021

Computable function analogs of natural learning and intelligence may not exist

Optimal learning : Meta-optimization

Many papers directly equate “machine” learning problem, algorithmic learning oppose to human or animal learning, with optimisation problem. Unfortunately, contrary to common belief machine learning is not an optimisation problem. For example, take optimal learning strategy, a replace learning with optimisation and we end up having and absurd terms of optimal optimisation strategy at one point.

Turing machine (Wikipedia)

Sound like practiced machine learning is a meta-optimisation problem, rather than a learning as humans do.

Computable functions to learning

Fundamentally, we do not know how human learning can be mapped into an algorithm or if there are computable function analogs of human learning or if human intelligence and its artificial analog can be represented as Turing computable manner.

Monday, 22 October 2012

Hands on Crash Tutorial for Supervised Learning : Multinomial Logistic Regression (MNL)

If you are a computer scientist, probably you would call a task supervised learning what others might call classification based on data. Data being anything that has a well defined structure and associated classes. We will work on hands on example for multi-class prediction with logistic regression i.e. multinomial logistic regression (MNL). What I meant by multi-class is here that we have $k \in \bf{Z}$ distinct classes for each observation, $k>1$. Actually if you think in terms of link functions from Generalized Linear Models (GLMs), the support of the distribution will tell you the distinction of the nature of the class. In statistics literature classes can be manifest as factors (or categories).

The following pointers would be helpful before you read further. From R perspective, I'd suggest German Rodrigez's notes for background reading. Data mining blog by Will Dwinnell is one of the clear descriptions in this direction for MATLAB. For sure, an authoritative resource on this is the book by Hastie et. al. called elements of statistical learning, you can obtain it from their Stanford page. An other excellent resource from Professor Ripley of Oxford, his books and R packages are known to be de facto standard in the field, particularly here I refer to nnet and applied statistics book.

Recall that a design matrix $\bf{X}$ is noting more then set of observations ($n$) in the rows. Observables ($p$) are placed along columns. The idea of supervised learning is that we train our model, here we choose to be multinomial GLM, against a data set and obtain coefficients of the model (parameters). Let's do this with the simplest possible example. (R codes start with '>' and MATLAB '>>')

Statistical toolbox brings set of functionality to do multinomial logistic regression. mnrfit is the main function we would like to use. Lets use a simple data set, it is trivial and the statistics we would get from this data might be very poor, but we aimed at the concept in this post.

>> X = [0.0 0.1 0.7 1.0 1.1 1.3 1.4 1.7 2.1 2.2]'; % design matrix
>> Y = [1 2 1 3 1 2 1 3 1 1]'; % associated classes

We can split this set to training and validation data set by choosing

>> trainIndex = [2 8 10 4 5];
>> validIndex = [1 3 6 7 9];

Now we can obtain the coefficients via mnrfit.

>> betaHat    = mnrfit(X(trainIndex), Y(trainIndex), 'model', 'ordinal', 'interactions', 'off', 'link', 'logit');
Note that, model is ordinal, meaning that it takes discrete values. Switched off interactions generates only one coefficient per class (betaHat vector). Then we can get the probabilities for the validation set.

>> predictProbs=mnrval(betaHat, X(validIndex), 'model', 'ordinal', 'interactions', 'off', 'link', 'logit');

ans =

    0.2980    0.1958    0.5061
    0.3636    0.2041    0.4324
    0.4242    0.2045    0.3713
    0.4346    0.2039    0.3615
    0.5084    0.1954    0.2961

So the highest probabilites form this matrix will give us [3, 3, 1, 1, 1] predictive set. With this approach we have predicted last two classes correctly.

Let's do this same example with R, even though the following approach may not be exactly the same procedure described above, however this is multinomial as well:
> require(nnet)
> # Design Matrix: 10 observations
> X            = matrix(c(0.0, 0.1, 0.7, 1.0, 1.1, 1.3, 1.4, 1.7, 2.1, 2.2));
> # Corresponding Classes: 3 ordinal classes
> Y            = matrix(c(0, 1, 0, 0,
                           0, 0, 1, 0,
                           0, 1, 0, 0,
                                     0, 0, 0, 1,
                                     0, 1, 0, 0,
                                     0, 0, 1, 0,
                                     0, 1, 0, 0,
                                     0, 0, 1, 0,
                                     0, 1, 0, 0,
                                     0, 1, 0, 0
                              ), nrow = 10, ncol=4, byrow=TRUE)

Similarly we can obtain training and validation sets

># Generate training and validation data sets for X, Y
> trainIndex   = c(2, 8, 10, 4, 5);
> validIndex   = c(1, 3, 6, 7, 9);
Now, we can use nnet package fitting function for multinomial log-linear.

> mfit = multinom(formula = Y[trainIndex, ] ~ X[trainIndex] + 0)

Note that we put 0 for the intercepts while first column of Y is dummy.

> predict(mfit, X[validIndex])
[1] 2 2 2 2 2

Resulting classification is not that great. The above example shown a basic approach to supervised learning in MATLAB and R. However, one must check statistical values, like residuals, deviance, p-values etc. for the quality of results.

Monday, 15 October 2012

Compressed Sensing with R

Compressed sensing (CS) is pretty much appealing all current signal processing research community. At the same time, popularity of R language gaining a strong foot in the research and industry. Even though historically MATLAB is a de-facto standard in signal processing community, R is becoming a serious alternative to this. For example, the quality of R in time-series analysis or medical imaging is now an accepted fact.

Last year, I have demonstrated how one can use R in compressed sensing research in a short tutorial in the pub, close to Liverpool street in London. The package R1magic is available in CRAN. Package provides basic interface to perform 1-D compressed sensing with l1, TV penalized minimization. There are other packages doing similar regularized minimization. However the interface of R1magic is particularly designed for CS i.e. appearance of sparse bases in the objective function.

Here is one simple example ( Version 0.1):

library(R1magic)# Signal components
N <- 100
# Sparse components
K <- 4
# Up to Measurements > K LOG (N/K)
M <- 40
# Measurement Matrix (Random Sampling Sampling)
phi <- GaussianMatrix(N,M)
# R1magic generate random signal
xorg <- sparseSignal(N, K, nlev=1e-3)
y <- phi %*% xorg ;# generate measurement
T <- diag(N) ;# Do identity transform
p <- matrix(0, N, 1) ;# initial guess
# R1magic Convex Minimization ! (unoptimized-parameter)
ll <- solveL1(phi, y, T, p)
x1 <- ll$estimate
plot( 1:100, seq(0.011,1.1,0.011), type = "n",xlab="",ylab="")
title(main="Random Sparse Signal Recovery",
xlab="Signal Component",ylab="Spike Value")
lines(1:100, xorg , col = "red")
lines(1:100, x1, col = "blue", cex = 1.5)
# shifted by 5 for clearity

Blue line is the reconstructed signal with R1magic.

Sunday, 27 November 2011

Biologically Inspired Algorithms for Financial Modelling

Biologically inspired algorithms (BIA) has a special place in heuristic modeling in many different fields of computational sciences, for example in materials simulations, AI, operational research among many others. A recent book covers the area of Financial Modeling using BIAs. Chapters about corporate failure prediction using grammatical evolution and ant colony are quite interesting. I think topic of failure prediction is closely related to resilience of a financial institution, see my earlier blog entry.