Monday 15 November 2021

Periodic Spectral Ergodicity Accurately Predicts Deep Learning Generalisation

 Preamble 

    Dali (1931),
The Persistence of Memory (Wikipedia)

One of the new mathematical concepts arise due to understanding of deep learning is called periodic spectral ergodicity (PSE). The cascading PSE (cPSE) propagates over deep learning layers which can also be used as a complexity measure. cPSE actually can also predict the generalisation ability. In this post, we review this interesting  finding in an easy and short manner.

How periodic spectral ergodicity cascades over layers

We have reviewed spectral ergodicity in a gentle fashion earlier, here.  Only difference is that in real deep learning architectures, length of the eigenvalue spectrum, i.e., the number  of bins in the histogram, generated by weight matrices are not equal in size. To align them, we use something called periodic boundary conditions or turn the eigenvalues in a cyclic fashion, up to the maximum length spectra we have seen up to that layer. Here are the steps that give, the intuition of how to compute cascading periodic spectral ergodicity (cPSE).

1. We compute eigenvalue spectrum up to a layer $i$ and align the smaller spectrum with periodic boundary conditions, i.e., cyclic.

2. Compute spectral ergodicity at layers $i$ and $i-1$.

3. Compute the cascading PSE at layer $i$ simply with a distance metric $\Omega^{i}$  and $\Omega^{i-1}$. i.e.,  KL divergence in two directions, recall earlier tutorials.  

If we repeat this up to the last layer, cPSE measures the complexity of the deep learning architecture, both capturing structural and learning algorithm-wise, in a depth of a layer fashion. 

 Generalisation Gap and cPSE

Apart from being a complexity measure, cPSE predicts the generalisation gap given reference architecture i.e., it correlates with the performance almost perfectly. These findings are presented in the paper suzen2019 .

Conclusions and Outlook

The complexity of deep learning architectures are still an open research problem.  One of the most promising direction is to use cPSE in terms of capturing structural complexity as well. While other measures in the literature did not consider depth dependency, whereby cPSE appears to be the first one.

Reference

@article{suzen2019,
  title={Periodic Spectral Ergodicity: A Complexity Measure for Deep Neural Networks and Neural Architecture Search},
  author={S{\"u}zen, Mehmet and Cerd{\`a}, Joan J and Weber, Cornelius},
  journal={arXiv preprint arXiv:1911.07831},
  year={2019}
}

Cite this post as  Periodic Spectral Ergodicity Accurately Predicts Deep Learning Generalisation, Mehmet Süzen,  https://science-memo.blogspot.com/2021/11/periodic-spectral-ergodicity-predicts-generalisation-deep-learning.html 2021

Appendix 

Bristol v0.12.2 is now supporting in computing cPSE from list of matrices

from bristol import cPSE

import numpy as np

np.random.seed(42)

matrices = [np.random.normal(size=(64,64)) for _ in range(10)]

(d_layers, cpse) = cPSE.cpse_measure_vanilla(matrices) 



Wednesday 28 July 2021

Deep Learning in Mind a Gentle Introduction to Spectral Ergodicity

Preamble

    Figure: Monalisa on
Eigenvector grids (Wikipedia)

In the post, A New Matrix Mathematics for Deep Learning : Random Matrix Theory of Deep Learning, we have outlined a new mathematical concepts that are aimed at deep learning but in general belonging to applied mathematics. Here, we dive into one of the concepts,  spectral ergodicity. We aimed at conveying what does it mean and how to compute spectral ergodicity for a set of matrices, i.e., ensemble. We will use a visual aid and verbal descriptions of steps to produce a quantitative measure of spectral ergodicity. 

The idea of spectral ergodicity comes from quantum statistical physics but it is recently revived for deep learning as a new concept in order to accommodate mathematical needs of explaining and understanding the complexity of deep learning architectures.

Understanding Spectral Ergodicity

The concept of ergodicity can get quiet mathematical even for a professional mathematician.  A practical understanding of ergodicity  could lead to the law of large numbers statistically speaking. However, observed ergodicity for ensemble of matrices, i.e. over their eigenvalue spectrum, are not formally defined before in the literature, and only appeared in statistical quantum mechanics in a specialised case.  Here we do a formal definition gently.

The spectral ergodicity of snapshot of values from $M$ matrices, where they are $N \times N$ sizes,  denoted by $\Omega$, can be produce with the following steps:
  1. Compute eigenvalues of $M$ matrices separately.  
  2. Produce equidistance spectra of matrices out of eigenvalues, i.e., histograms with $b_{k}$ bins. Each cell in the Figure corresponds to bin in the spectra of the matrices. 
  3. Compute average values over each bin across  $M$ matrices.
  4. Computing root mean square deviation that went to each bin from $M$ matrices from corresponding ensemble averaged value and average over $M$ and $N$. This will give a distribution, $\Omega=\Omega(b_{k})$, which represents spectral ergodicity value, think as a snapshot value of a dynamical process.
Attentive reader would notice that normally, measures of ergodicity leads to a single value, such as in spin-glasses, but here we obtain ergodicity as a measure distribution. This stems from the fact that our observable is not univariate but it is a multivariate measure over spectra of the matrix, i.e., bins in the histogram of eigenvalues.  

Why spectral ergodicity important for deep learning? 

The reason why this measure is so important lies in dynamics and consistency in measuring observables (no nothing to do with quantum mechanics but time and ensemble averages classically). Normally we can't measure ensemble averages. In experimental conditions the measurement we do is usually a time averaged value. This is exactly what happens when we train deep neural network, i.e, ergodicity of weight matrices. Essentially, spectral ergodicity would capture deep neural network's characteristics.
Outlook

The way we express spectral ergodicity here would only consider all layer having the same size.  One would need a more advanced computation of spectral ergodicity for more realistic architectures, which is called cascading Periodic Spectral Ergodicity measure suitable as a complexity measure for deep learning.  The computation of such measure is more involved and spectral ergodicity we cover here is the first step.

Cite this post with  Deep Learning in Mind Very Gentle Introduction to Spectral Ergodicity, Mehmet Süzen, (2021) https://science-memo.blogspot.com/2021/07/deep-learning-random-matrix-theory-spectral-ergodicity.html 

Wednesday 21 July 2021

A New Matrix Mathematics for Deep Learning : Random Matrix Theory of Deep Learning

 Preamble 

    Figure: Definition of Randomness
 (Compagner 1991, Delft University)
Development of deep learning systems (DLs)  increased our hopes to develop more autonomous systems. Based on the hierarchal learning of representations, deep learning defies the basic learning theory that beg the question of still rethinking generalisation. Even though DLs lacks severely the ability to reason without causal inference, they can't do that in vanilla form. However despite this limitation, they provide very rich new mathematical concepts as introduced recently. Here, we review couple of these new concepts briefly and draw attention to Random Matrix Theory's relevance in DLs and its applications in Brain networks.  These concepts in isolation are subject of applied mathematics but their interpretation and usage in deep learning architectures are demonstrated recently. In this post we provide a glossary of new concepts, that are not only theoretically interesting, they are directly practical from measuring architecture complexity to equivalance.  

Random matrices can simulate deep learning architectures with spectral ergodicity

Random Matrix Theory (RMT) has origins in foundation of mathematical statistics and mathematical physics pioneered by Wishart Distribution and Dyson Circular Ensembles.  As primary ingredient of a deep learning model as a result are set of weights, or learned parameter set, manifests as matrices and they come from a learning dynamics that are used in so called in inference time. Natural consequence of this, learning these matrices can be simulated via Random matrices of spectral radius close to unity. This provides us the following, ability to make a generic statement about deep learning systems independent of 
  1. Network architecture (topology).
  2. Learning algorithm. 
  3. Data sizes and type.
  4. Training procedure.

Why not Hessian or loss-landscape but Weight matrices? 

There are studies taking Hessian matrix as a major object, i.e., second derivative of parameters as a function of loss of the network and associate this to random matrices. However, this approach would only covers learning algorithm properties rather than architectures inference or learning capacity. For this reason, weight matrices should be taken as a primary object in any studies of random matrix theory in deep learning as they encode depth in deep learning. Similarly, loss-landscape can not capture the capacity of deep learning. 

Conclusion and outlook

In this short exposition, we tried to stimulate readers interest in exciting set of tools from RMTs for deep learning theory and practice. That is still subject of recent research with direct practical relevance. We provided glossary and reading list as well.  

Further Reading

Papers introducing new mathematical concepts in deep learning are listed here, they come with associated Python codes for reproducing the concepts.

Earlier relevant blog posts 

Citing this post

A New Matrix Mathematics of Deep Learning: Random Matrix Theory of Deep Learning : https://science-memo.blogspot.com/2021/07/random-matrix-theory-deep-learning.html Mehmet Süzen, 2021

Glossary of New Mathematical Concepts of Deep Learning

Summary of the definition of new mathematical concepts for new matrix mathematics.

Spectral Ergodicity Measure of ergodicity in spectra of a given random matrix ensemble sizes. Given set of matrices of equal size that are coming from the same ensemble, average deviation of spectral densities of individual eigenvalues over ensemble averaged eigenvalue. This mimic standard ergodicity, instead of over states of the observable, it measures ergodicity over eigenvalue densities.  $\Omega_{k}^{N}$, $k$-th eigenvalue and matrix size of $N$.

Spectral Ergodicity Distance A symmetric distance constructed with two Kullback-Leibler distances over two different size matrix ensembles, in two different direction. $D = KL(N_{a}|N_{b})+ KL(N_{b}|N_{a})$

Mixed Random Matrix Ensemble (MME) Set of matrices constructed from a random ensemble but with difference matrix sizes from N to 2, sizes determined randomly with a coefficient of mixture. 

Periodic Spectral Ergodicity (PSE) A measure of Spectral ergodicity for MMEs whereby smaller matrix spectrum placed in periodic boundary conditions, i.e., cyclic list of eigenvalues, simply repeating them up to N eigenvalues. 

Layer Matrices Set of learned weight matrices up to a layer in deep learning architecture. Convolutional layers mapped into a matrix, i.e. stacked up. 

Cascading Periodic Spectral Ergodicity (cPSE) Measuring PSE over feedforward manner in a deep neural network.  Ensemble size is taken up-to that layer matrices. 

Circular Spectral Deviation (CSD) This is a measure of fluctuations in spectral density between two ensembles.

Matrix Ensemble Equivalence If CSDs are vanishing for conjugate MMEs, they are said to be equivalent.

Appendix: Practical Python Example

Complexity measure for deep architectures and random matrix ensembles: cPSE.cpse_measure_vanilla Python package Bristol  (>= v0.2.12) has now a support for computing cPSE from a list of matrices, no need to put things in torch model format by default.


!pip install bristol==0.2.12


An example case:


from bristol import cPSE

import numpy as np

np.random.seed(42)

matrices = [np.random.normal(size=(64,64)) for _ in range(10)]

(d_layers, cpse) = cPSE.cpse_measure_vanilla(matrices) 


d_layers is decreasing vector, it will saturate at some point, that point is where adding more

layers won’t improve the performance. This is data, learning or architecture independent measure.

Only a French word can explain the excitement here: Voilà!





Friday 23 April 2021

On the fallacy of replacing physical laws with machine-learned inference systems

Preamble

Progress in machine learning, specifically so-called deep learning, last decade was astonishingly successful in many areas from computer vision to natural language translation reaching automation close to human-level performance in narrow areas, so-called narrow artificial intelligence. At the same time, the scientific and academic communities also joined in applying deep learning in physics and in general physical sciences. If this is used as an assistance to known techniques, it is really good progress, such as drug discovery, accelerating molecular simulations or astrophysical discoveries to understand the universe. However, unfortunately, it is now almost standard claim that one supposedly could replace physical laws with deep learning models: we criticise these claims in general without naming any of our colleagues or works. 

Circular reasoning: Usage of data produced by known physics 

Blind monks examining an elephant
(Wikipedia)

The primary fallacy on papers claiming to be able to produce a learning system that can actually produce physical laws or replace physics with a deep learning system lies in how these systems are trained. Regardless of how good they are in predictions, their primary ability is the product of already known laws. They would only replicate the laws provided within datasets that are generated by physical laws.  

Faulty generalisation: Computational acceleration in narrow application to replacing laws

One of the major faults in concluding that a machine-learned inference system doing better than the physical law is the faulty generalisation of computational acceleration in narrow application areas. This computational acceleration can not be generalised to all parameter space while systems are usually trained in certain restricted parameter space that physical laws generated data, for example solving N-body problems, or dynamics in any scale from action or Lagrangian and generating fundamental particle physics Lagrangians.

Benefits: Causality still requires scientist

The intention of this short article here aimed at showing limitations of using machine-learned inference systems in discovering scientific laws: there are of course benefits of leveraging machine learning and data science techniques in physical sciences, especially accelerating simulations in narrow specialised areas, automating tasks and assisting scientist in cumbersome validations, such as searching and translating in two domains, especially in medicine and astrophysics, for example sorting images of galaxy formations. However, the results would still need a skilled physicist or scientist to really understand and form a judgment for a scientific law or discovery, i.e., establishing causality

Conclusion : No automated physicist or automated scientific discovery

Artificial general intelligence is not founded yet and has not been achieved. It is for the benefit of physical sciences that researchers do not claim that they found a deep learning system that can replace physical laws in supervised or semi-supervised settings rather concentrate on applications that benefit both theoretical and applied advancement in down to earth fashion. Similarly, funding agencies should be more reasonable and avoid funding such claims.

In summary, if datasets are produced by known physical laws or mathematical principles, the new deep learning system only replicates what was already known and it is not new knowledge, regardless of how these systems can predict or behave with new predictions. Caution is advised. We can not yet replace physicists with machine-learned inference systems, actually, not even radiologists are replaced, despite the impressive advancement in computer vision that produces super-human results. 


 @misc{suezen21fallacy, 
     title = {On the fallacy of replacing physical laws with machine-learned inference systems}, 
     howpublished = {\url{http://science-memo.blogspot.com/2021/04/on-fallacy-of-replacing-physical-laws.html}}, 
     author = {Mehmet Süzen},
     year = {2021}
}  



Postscripts

The following interpretations, reformulations are curated after initial post. 


Postscript 1: Regarding Symbolic regression

There are now multiple claims that one could replace physics with symbolic regression. Yes, symbolic regression is quite a powerful method. However, using raw data produced by physical laws, so called simulation data from classical mechanics or modelling experimental data guided by functional forms provided by physics do not imply that one could replace physics or physical laws with machine learned system. We have not achieved Artificial General Intelligence (AGI) and symbolic regression is not AGI. Symbolic regression may not be even useful beyond verification tool for theory and numerical solutions of physical laws.

Postscript 2: Fallacy on the dimensionality reduction and distillation of physical laws with machine learning

There are now multiple claims that one could distill physical dynamical laws with dimensionality reduction. This is indeed a novel approach. However, the core dataset is generated by the coupled set of dynamical equations that is suppose to be reduced with fixed set of initial conditions. This does not imply any kind of distillation of set of original laws, i.e., the procedure can not be qualified as distilling set of equations to less number of equations or variates. It only provides an accelerated deployment of dynamical solvers under very specific conditions. This includes any renormalisation group dynamics.

Postscript 3: A new terms, Scientific Machine Learning Fallacy and s-PINNs.

Usage of symbolic regression with deep learning should be called symbolic physics informed neural networks (s-PINNs. Calling these approaches  “machine scientist”, “automated scientist”, “physics laws generator” are technically  a fallacy, i.e., Scientific Machine Learning Fallacy, primarily caught up in circular reasoning.  

Postscript 4: AutoML is a misnomer : Scientific Machine Learning (SciML) Fallacy 

SciML is immensely promising in providing accelerated deployment of known scientific workflows: specialised areas such as trajectory learning, novel operator solvers, astrophysical image processing, molecular dynamics and computational applied mathematics in general. Unfortunately, some recent papers continue on jumping into claims of automated scientific discovery and replacing known physical laws with supervised learning systems, including new NLP systems.  


The primary fallacy on papers claiming to be able to produce a learning system that can actually produce physical/scientific laws or replace physics/science with a deep learning system lies in how these systems are trained. AutoML in this context actually doesn’t replace scientist but abstract out former workflows into different meta scientific work assisting scientists: hence a misnomer, MetaML is probably more suited terminology. 



Thursday 1 April 2021

Shifting Modern Data Science Forward: Dijkstra principle for data science


Prelude
Dijkstra in Zurich, 1984 (Wikipedia)

Edsger Dijkstra was a Dutch theoretical physicist turned computer scientist, and probably one of the most influential earlier pioneers in the field. He had deep insight in what is computer science and well founded notion of how should it be taught in academics. In this post we extrapolate his ideas into data science. We developed something called, Dijkstra principle for data science, that is driven by his ideas on what does computer science entails.

Computer Science and Astronomy 

Astronomy is not about telescopes. Indeed, it is about how universe works and how its constituent parts are interacting. Telescopes, either being optical or radio observations or similar detection techniques are merely tools to practice and do investigation for astronomy. A formed analogy goes into computer science as well, this is the quote from Dijkstra:
Computer science is no more about computers than astronomy is about telescopes.  - Edsger Dijkstra
The idea of Computer Science being not about computer is rather strange in the first instance. However, what Dijkstra had in mind is abstract mechanism and mathematical constructs that one can map real problems and solve it as a computer science problem, such as graph algorithms. Though Computer Science had a lot of subfields but its inception can be considered as rooted in applied mathematics.

Dijkstra principle for data science

By using Dijkstra's approach now we are in position to formulate a principle for data science. 
Data science is no more about data than computer science is about computers. -Dijkstra principle for data science
This sounds absurd. If data science is not about data, then what is it about? Apart from definition of data science as an emergent field, as an amalgamation of multiple fields from statistics to high performance computing,  the idea that data not being the core tenant of data science implies the practice does not aim at data itself rather a higher purpose. Data is used similar to a telescope in astronomy, the purpose is to reveal the empirical truths about representations data conveys. There is no unique ways to achieve this purpose. 

Conclusive Remarks

Dijkstra principle for data science would be very helpful in understanding the data science practice as not data-centric, contrary to mainstream dogma, rather as a science-centric  practice with the data being the primary tool to leverage, using multitude of techniques. Implication is that machine learning is a secondary tool on top of data in practicing data science. This attitude would help causality playing a major role shifting modern data science forward.


Saturday 20 March 2021

Computable function analogs of natural learning and intelligence may not exist


Optimal learning : Meta-optimization

Many papers directly equate “machine” learning problem, algorithmic learning oppose to human or animal learning, with optimisation problem. Unfortunately, contrary to common belief  machine learning is not an optimisation problem. For example, take optimal learning strategy, a replace learning with optimisation and we end up having and absurd terms of optimal optimisation strategy at one point. 

Turing machine (Wikipedia)
Sound like practiced machine learning is a meta-optimisation problem, rather than a learning as humans do.

Computable functions to learning

Fundamentally, we do not know how human learning can be mapped into an algorithm or if there are computable function analogs of human learning or if human intelligence and its artificial analog can be represented as Turing computable manner.

Sunday 7 March 2021

Critical look on why deployed machine learning model performance degrade quickly

Illustration of William of Ockham 
(Wikipedia)
One of the major problems in using so called machine learning model, usually a supervised model, in so called deployment, meaning it will serve new data points which were not in the training or test set,  with great astonishment, modellers or data scientist observe that model's performance degrade quickly or it doesn't perform as good as test set performance. We earlier ruled out that underspecification would not be the main cause. Here we proposed that the primary reason of such performance degradation lies on the usage of hold out method in judging generalised performance solely.

Why model test performance does not reflect in deployment? Understanding overfitting

Major contributing factor is due to inaccurate meme of overfitting which actually meant overtraining and connecting overtraining erroneously to generalisation solely.  This was discussed earlier here as understanding overfitting. Overfitting is not about how good  is the function approximation compared to other subsets of the dataset of the same “model” works. Hence, the hold-out method (test/train) of measuring performances  does not  provide sufficient and necessary conditions to judge model’s generalisation ability: with this approach we can not detect overfitting (in Occam’s razor sense) and as well the deployment performance. 

How to mimic deployment performance?

This depends on the use case but the most promising approaches lies in adaptive analysis and detected distribution shifts and build models accordingly. However, the answer to this question is still an open research.
(c) Copyright 2008-2024 Mehmet Suzen (suzen at acm dot org)

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.