Wednesday 28 July 2021

Deep Learning in Mind a Gentle Introduction to Spectral Ergodicity

Preamble

    Figure: Monalisa on
Eigenvector grids (Wikipedia)

In the post, A New Matrix Mathematics for Deep Learning : Random Matrix Theory of Deep Learning, we have outlined a new mathematical concepts that are aimed at deep learning but in general belonging to applied mathematics. Here, we dive into one of the concepts,  spectral ergodicity. We aimed at conveying what does it mean and how to compute spectral ergodicity for a set of matrices, i.e., ensemble. We will use a visual aid and verbal descriptions of steps to produce a quantitative measure of spectral ergodicity. 

The idea of spectral ergodicity comes from quantum statistical physics but it is recently revived for deep learning as a new concept in order to accommodate mathematical needs of explaining and understanding the complexity of deep learning architectures.

Understanding Spectral Ergodicity

The concept of ergodicity can get quiet mathematical even for a professional mathematician.  A practical understanding of ergodicity  could lead to the law of large numbers statistically speaking. However, observed ergodicity for ensemble of matrices, i.e. over their eigenvalue spectrum, are not formally defined before in the literature, and only appeared in statistical quantum mechanics in a specialised case.  Here we do a formal definition gently.

The spectral ergodicity of snapshot of values from $M$ matrices, where they are $N \times N$ sizes,  denoted by $\Omega$, can be produce with the following steps:
  1. Compute eigenvalues of $M$ matrices separately.  
  2. Produce equidistance spectra of matrices out of eigenvalues, i.e., histograms with $b_{k}$ bins. Each cell in the Figure corresponds to bin in the spectra of the matrices. 
  3. Compute average values over each bin across  $M$ matrices.
  4. Computing root mean square deviation that went to each bin from $M$ matrices from corresponding ensemble averaged value and average over $M$ and $N$. This will give a distribution, $\Omega=\Omega(b_{k})$, which represents spectral ergodicity value, think as a snapshot value of a dynamical process.
Attentive reader would notice that normally, measures of ergodicity leads to a single value, such as in spin-glasses, but here we obtain ergodicity as a measure distribution. This stems from the fact that our observable is not univariate but it is a multivariate measure over spectra of the matrix, i.e., bins in the histogram of eigenvalues.  

Why spectral ergodicity important for deep learning? 

The reason why this measure is so important lies in dynamics and consistency in measuring observables (no nothing to do with quantum mechanics but time and ensemble averages classically). Normally we can't measure ensemble averages. In experimental conditions the measurement we do is usually a time averaged value. This is exactly what happens when we train deep neural network, i.e, ergodicity of weight matrices. Essentially, spectral ergodicity would capture deep neural network's characteristics.
Outlook

The way we express spectral ergodicity here would only consider all layer having the same size.  One would need a more advanced computation of spectral ergodicity for more realistic architectures, which is called cascading Periodic Spectral Ergodicity measure suitable as a complexity measure for deep learning.  The computation of such measure is more involved and spectral ergodicity we cover here is the first step.

Cite this post with  Deep Learning in Mind Very Gentle Introduction to Spectral Ergodicity, Mehmet Süzen, (2021) https://science-memo.blogspot.com/2021/07/deep-learning-random-matrix-theory-spectral-ergodicity.html 

Wednesday 21 July 2021

A New Matrix Mathematics for Deep Learning : Random Matrix Theory of Deep Learning

 Preamble 

    Figure: Definition of Randomness
 (Compagner 1991, Delft University)
Development of deep learning systems (DLs)  increased our hopes to develop more autonomous systems. Based on the hierarchal learning of representations, deep learning defies the basic learning theory that beg the question of still rethinking generalisation. Even though DLs lacks severely the ability to reason without causal inference, they can't do that in vanilla form. However despite this limitation, they provide very rich new mathematical concepts as introduced recently. Here, we review couple of these new concepts briefly and draw attention to Random Matrix Theory's relevance in DLs and its applications in Brain networks.  These concepts in isolation are subject of applied mathematics but their interpretation and usage in deep learning architectures are demonstrated recently. In this post we provide a glossary of new concepts, that are not only theoretically interesting, they are directly practical from measuring architecture complexity to equivalance.  

Random matrices can simulate deep learning architectures with spectral ergodicity

Random Matrix Theory (RMT) has origins in foundation of mathematical statistics and mathematical physics pioneered by Wishart Distribution and Dyson Circular Ensembles.  As primary ingredient of a deep learning model as a result are set of weights, or learned parameter set, manifests as matrices and they come from a learning dynamics that are used in so called in inference time. Natural consequence of this, learning these matrices can be simulated via Random matrices of spectral radius close to unity. This provides us the following, ability to make a generic statement about deep learning systems independent of 
  1. Network architecture (topology).
  2. Learning algorithm. 
  3. Data sizes and type.
  4. Training procedure.

Why not Hessian or loss-landscape but Weight matrices? 

There are studies taking Hessian matrix as a major object, i.e., second derivative of parameters as a function of loss of the network and associate this to random matrices. However, this approach would only covers learning algorithm properties rather than architectures inference or learning capacity. For this reason, weight matrices should be taken as a primary object in any studies of random matrix theory in deep learning as they encode depth in deep learning. Similarly, loss-landscape can not capture the capacity of deep learning. 

Conclusion and outlook

In this short exposition, we tried to stimulate readers interest in exciting set of tools from RMTs for deep learning theory and practice. That is still subject of recent research with direct practical relevance. We provided glossary and reading list as well.  

Further Reading

Papers introducing new mathematical concepts in deep learning are listed here, they come with associated Python codes for reproducing the concepts.

Earlier relevant blog posts 

Citing this post

A New Matrix Mathematics of Deep Learning: Random Matrix Theory of Deep Learning : https://science-memo.blogspot.com/2021/07/random-matrix-theory-deep-learning.html Mehmet Süzen, 2021

Glossary of New Mathematical Concepts of Deep Learning

Summary of the definition of new mathematical concepts for new matrix mathematics.

Spectral Ergodicity Measure of ergodicity in spectra of a given random matrix ensemble sizes. Given set of matrices of equal size that are coming from the same ensemble, average deviation of spectral densities of individual eigenvalues over ensemble averaged eigenvalue. This mimic standard ergodicity, instead of over states of the observable, it measures ergodicity over eigenvalue densities.  $\Omega_{k}^{N}$, $k$-th eigenvalue and matrix size of $N$.

Spectral Ergodicity Distance A symmetric distance constructed with two Kullback-Leibler distances over two different size matrix ensembles, in two different direction. $D = KL(N_{a}|N_{b})+ KL(N_{b}|N_{a})$

Mixed Random Matrix Ensemble (MME) Set of matrices constructed from a random ensemble but with difference matrix sizes from N to 2, sizes determined randomly with a coefficient of mixture. 

Periodic Spectral Ergodicity (PSE) A measure of Spectral ergodicity for MMEs whereby smaller matrix spectrum placed in periodic boundary conditions, i.e., cyclic list of eigenvalues, simply repeating them up to N eigenvalues. 

Layer Matrices Set of learned weight matrices up to a layer in deep learning architecture. Convolutional layers mapped into a matrix, i.e. stacked up. 

Cascading Periodic Spectral Ergodicity (cPSE) Measuring PSE over feedforward manner in a deep neural network.  Ensemble size is taken up-to that layer matrices. 

Circular Spectral Deviation (CSD) This is a measure of fluctuations in spectral density between two ensembles.

Matrix Ensemble Equivalence If CSDs are vanishing for conjugate MMEs, they are said to be equivalent.

Appendix: Practical Python Example

Complexity measure for deep architectures and random matrix ensembles: cPSE.cpse_measure_vanilla Python package Bristol  (>= v0.2.12) has now a support for computing cPSE from a list of matrices, no need to put things in torch model format by default.


!pip install bristol==0.2.12


An example case:


from bristol import cPSE

import numpy as np

np.random.seed(42)

matrices = [np.random.normal(size=(64,64)) for _ in range(10)]

(d_layers, cpse) = cPSE.cpse_measure_vanilla(matrices) 


d_layers is decreasing vector, it will saturate at some point, that point is where adding more

layers won’t improve the performance. This is data, learning or architecture independent measure.

Only a French word can explain the excitement here: Voilà!





(c) Copyright 2008-2024 Mehmet Suzen (suzen at acm dot org)

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.