Showing posts with label probabilistic models. Show all posts
Showing posts with label probabilistic models. Show all posts

Tuesday, 15 November 2022

Differentiating ensembles and sample spaces: Alignment between statistical mechanics and probability theory

Preamble 

Sample space is the primary concept introduced in any probability and statistics books and in papers. However, there needs to be more clarity about what constitutes a sample space in general: there is no explicit distinction between the unique event set and the replica sets. The resolution of this ambiguity lies in the concept of an ensemble.  The concept is first introduced by American theoretical physicist and engineer Gibbs in his book Elementary principle of statistical mechanics The primary utility of an ensemble is a mathematical construction that differentiates between samples and how they would form extended objects. 

In this direction, we provide the basics of constructing ensembles in a pedagogically accessible way from sample spaces that clears up a possible misconception. This usage of ensemble prevents the overuse of the term sample space for different things. We introduce some basic formal definitions.

    Figure: Gibbs's book
 introduced the concept of
ensemble (Wikipedia).

What Gibbs's had in mind by constructing statistical ensembles?

A statistical ensemble is a mathematical tool that connects statistical mechanics to thermodynamics. The concept lies in defining microscopic states for molecular dynamics; in statistics and probability, this corresponds to a set of events. Though these events are different at a microscopic level, they are sampled from a single thermodynamics ensemble, a representative of varying material properties or, in general, a set of independent random variables. In dynamics, micro-states samples an ensemble. This simple idea has helped Gibbs to build a mathematical formalism of statistical mechanics companion to Boltzmann's theories.

Differentiating sample space and ensemble in general

The primary confusion in probability theory on what constitutes a samples space is that there is no distinction between primitive events or events composed of primitive events. We call both sets sample space. This terminology easily overlooked in general as we concentrate on events set but not the primitive events set in solving practical problems.   

Definition: A primitive event $\mathscr{e}$ implies a logically distinct unit of experimental realisation that has not composed of any other events.

Definition: A sample space $\mathscr{S}$ is a set formed by all $N$ distinct primitive events $\mathscr{e}_{i}$.  

By this definition, regardless of how many fair coins are used or if a coin toss in a sequence for the experiment, the sample space is always ${H,T}$, because these are the most primitive distinct events a system can have, i.e., a single coin outcomes. However, the statistical ensemble can be different.  For example for two fair coins or coin toss in sequence of length two, corresponding ensemble of system size two reads ${HH, TT, HT, TH}$. Then, the definition of ensemble follows. 

Definition: An ensemble  $\mathscr{E}$ is a set of ordered set of primitive events $\mathscr{e}_{i}$. These event sets can be sampled with replacement but order matters, i.e., $ \{e_{i}, e_{j} \} \ne  \{e_{j}, e_{i} \}$, $i \ne j$.

Our two coin example's ensemble should be formally written as $\mathscr{E}=\{\{H,H\}, \{T,T\}, \{H,T\}, \{T,H\}\}$, as order matters members $HT$ and $TH$ are distinct. Obviously for a single toss ensemble and a sample space will be the same. 

Ergodicity makes the need for differentiation much more clear : Time and ensemble averaging 

The above distinction makes building time and ensemble averaging much easier. The term ensemble averaging is obvious as we know what would be the ensemble set and averaging over this set for a given observable.  Time averaging then could be achieved by curating a much larger set by resampling with replacement from the ensemble. Note that the resulting time-average value would not be unique, as one can generate many different sample sets from the ensemble. However, bear in mind that the definition of how to measure convergence to ergodic regime is not unique.

Conclusion

Even though the distinction we made sounds very obscure,  this alignment between statistical mechanics and probability theory may clarify the conception of ergodic regimes for general practitioners.

Further reading

Please Cite:

 @misc{suezen22dess, 
     title = {Differentiating ensembles and sample spaces: Alignment between statistical mechanics and probability theory}, 
     howpublished = {\url{https://science-memo.blogspot.com/2022/11/ensembles-probability-theory.html}, 
     author = {Mehmet Süzen},
     year = {2022}
}  

Postscript

  • If there are multiple events coming from set of primitive events, compositional outcomes considered to be ensemble not sample space. Sample space is a set that we sample from, either one or multiple times to build an ensemble. Ensemble notion within pure ML context was also noticed by late David J. C. MacKay, in his book Information Theory, Inference and Learning, Cambridge University Press (2003).


Tuesday, 5 July 2022

Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra

Preamble

    The White Rabbit
(Wikipedia)

A novice analyst or even experienced (data) scientist would have thought that the bar notation $|$ in representing conditional probability carries some different operational mathematics. Primarily when written in explicit distribution functions $p(x|y)$. Similar approach applies to joint probabilities such as $p(x, y)$ too. One could see a mixture of these, such as $p(x, y | z)$. In this short exposition, we clarify that none of these identifications within arguments of probability do have any different resulting operational meaning. 

Arguments in probabilities: Boolean statement and filtering 

Arguments in any probability are mathematical statements of discrete mathematics that correspond to events in the experimental setting. These are statements declaring some facts with a boolean outcome. These statements are queries to a data set. Such as, if the temperature is above $30$ degrees, $T > 30$. Temperature $T$ is a random variable. Unfortunately, the term random variable is often used differently in many textbooks. It is defined as a mapping rather than as a single variable. The bar $|$ in conditional probability $p(x|y)$, implies statement $x$ given that statement $y$ has already occurred, i.e., if. This interpretation implies that $y$ first occurred before $x$, but it doesn't imply that they are causally linked. The condition plays a role in filtering, a where clause in query languages. $p(x|y)$ boils down to $p_{y}(x)$, where the first statement $y$ is applied to the dataset before computing the probability on the remaining statement $x$.

In the case of joint probabilities $p(x, y)$, events co-occur, i.e., AND statement. In summary, anything in the argument of $p$ is written as a mathematical statement. In the case of assigning a distribution or a functional form to $p$, there is no particular role for conditionals or joints; the modelling approach sets an appropriate structure.

Conditioning does not imply casual direction: do-Calculus do

A filtering interpretation of conditional $p(x|y)$ does not imply causal direction, but $do$ operator does, $p(x|do(y))$. 

Non-commutative algebra: When frequentist are equivalent to Bayesian

Most of the simple filtering operations would result in identical results if reversed. $p(x|y) = p(y|x)$, prior being equal to posterior. This remark implies we can't apply Bayesian learning with commutative statements. We need non-commutative statements; as a result, one can do Bayesian learning with the newly arriving data, i.e., the arrival of new subjective evidence. The reason seems to be due to the frequentist nature of filtering.

Outlook 

Even though we provided some revelations on decoding the operational meaning of conditional probabilities, we suggested that any conditional, joint or any combination of these within the argument of probabilities has no operational purpose other than pre-processing steps. However, the philosophical and practical implications of probabilistic reasoning are always counterintuitive. Probabilistic reasoning is a complex problem computationally. From a causal inference perspective, we are better equipped to tackle these issues with do-Bayesian analysis.  

Further reading

Please Cite as:

 @misc{suezen22brh, 
     title = {Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra}, 
     howpublished = {\url{https://science-memo.blogspot.com/2022/07/bayesian-conditional-noncommutative.html}}, 
     author = {Mehmet Süzen},
     year = {2022}
}  

Wednesday, 21 July 2021

A New Matrix Mathematics for Deep Learning : Random Matrix Theory of Deep Learning

 Preamble 

    Figure: Definition of Randomness
 (Compagner 1991, Delft University)
Development of deep learning systems (DLs)  increased our hopes to develop more autonomous systems. Based on the hierarchal learning of representations, deep learning defies the basic learning theory that beg the question of still rethinking generalisation. Even though DLs lacks severely the ability to reason without causal inference, they can't do that in vanilla form. However despite this limitation, they provide very rich new mathematical concepts as introduced recently. Here, we review couple of these new concepts briefly and draw attention to Random Matrix Theory's relevance in DLs and its applications in Brain networks.  These concepts in isolation are subject of applied mathematics but their interpretation and usage in deep learning architectures are demonstrated recently. In this post we provide a glossary of new concepts, that are not only theoretically interesting, they are directly practical from measuring architecture complexity to equivalance.  

Random matrices can simulate deep learning architectures with spectral ergodicity

Random Matrix Theory (RMT) has origins in foundation of mathematical statistics and mathematical physics pioneered by Wishart Distribution and Dyson Circular Ensembles.  As primary ingredient of a deep learning model as a result are set of weights, or learned parameter set, manifests as matrices and they come from a learning dynamics that are used in so called in inference time. Natural consequence of this, learning these matrices can be simulated via Random matrices of spectral radius close to unity. This provides us the following, ability to make a generic statement about deep learning systems independent of 
  1. Network architecture (topology).
  2. Learning algorithm. 
  3. Data sizes and type.
  4. Training procedure.

Why not Hessian or loss-landscape but Weight matrices? 

There are studies taking Hessian matrix as a major object, i.e., second derivative of parameters as a function of loss of the network and associate this to random matrices. However, this approach would only covers learning algorithm properties rather than architectures inference or learning capacity. For this reason, weight matrices should be taken as a primary object in any studies of random matrix theory in deep learning as they encode depth in deep learning. Similarly, loss-landscape can not capture the capacity of deep learning. 

Conclusion and outlook

In this short exposition, we tried to stimulate readers interest in exciting set of tools from RMTs for deep learning theory and practice. That is still subject of recent research with direct practical relevance. We provided glossary and reading list as well.  

Further Reading

Papers introducing new mathematical concepts in deep learning are listed here, they come with associated Python codes for reproducing the concepts.

Earlier relevant blog posts 

Citing this post

A New Matrix Mathematics of Deep Learning: Random Matrix Theory of Deep Learning : https://science-memo.blogspot.com/2021/07/random-matrix-theory-deep-learning.html Mehmet Süzen, 2021

Glossary of New Mathematical Concepts of Deep Learning

Summary of the definition of new mathematical concepts for new matrix mathematics.

Spectral Ergodicity Measure of ergodicity in spectra of a given random matrix ensemble sizes. Given set of matrices of equal size that are coming from the same ensemble, average deviation of spectral densities of individual eigenvalues over ensemble averaged eigenvalue. This mimic standard ergodicity, instead of over states of the observable, it measures ergodicity over eigenvalue densities.  $\Omega_{k}^{N}$, $k$-th eigenvalue and matrix size of $N$.

Spectral Ergodicity Distance A symmetric distance constructed with two Kullback-Leibler distances over two different size matrix ensembles, in two different direction. $D = KL(N_{a}|N_{b})+ KL(N_{b}|N_{a})$

Mixed Random Matrix Ensemble (MME) Set of matrices constructed from a random ensemble but with difference matrix sizes from N to 2, sizes determined randomly with a coefficient of mixture. 

Periodic Spectral Ergodicity (PSE) A measure of Spectral ergodicity for MMEs whereby smaller matrix spectrum placed in periodic boundary conditions, i.e., cyclic list of eigenvalues, simply repeating them up to N eigenvalues. 

Layer Matrices Set of learned weight matrices up to a layer in deep learning architecture. Convolutional layers mapped into a matrix, i.e. stacked up. 

Cascading Periodic Spectral Ergodicity (cPSE) Measuring PSE over feedforward manner in a deep neural network.  Ensemble size is taken up-to that layer matrices. 

Circular Spectral Deviation (CSD) This is a measure of fluctuations in spectral density between two ensembles.

Matrix Ensemble Equivalence If CSDs are vanishing for conjugate MMEs, they are said to be equivalent.

Appendix: Practical Python Example

Complexity measure for deep architectures and random matrix ensembles: cPSE.cpse_measure_vanilla Python package Bristol  (>= v0.2.12) has now a support for computing cPSE from a list of matrices, no need to put things in torch model format by default.


!pip install bristol==0.2.12


An example case:


from bristol import cPSE

import numpy as np

np.random.seed(42)

matrices = [np.random.normal(size=(64,64)) for _ in range(10)]

(d_layers, cpse) = cPSE.cpse_measure_vanilla(matrices) 


d_layers is decreasing vector, it will saturate at some point, that point is where adding more

layers won’t improve the performance. This is data, learning or architecture independent measure.

Only a French word can explain the excitement here: Voilà!





Sunday, 7 March 2021

Critical look on why deployed machine learning model performance degrade quickly

Illustration of William of Ockham 
(Wikipedia)
One of the major problems in using so called machine learning model, usually a supervised model, in so called deployment, meaning it will serve new data points which were not in the training or test set,  with great astonishment, modellers or data scientist observe that model's performance degrade quickly or it doesn't perform as good as test set performance. We earlier ruled out that underspecification would not be the main cause. Here we proposed that the primary reason of such performance degradation lies on the usage of hold out method in judging generalised performance solely.

Why model test performance does not reflect in deployment? Understanding overfitting

Major contributing factor is due to inaccurate meme of overfitting which actually meant overtraining and connecting overtraining erroneously to generalisation solely.  This was discussed earlier here as understanding overfitting. Overfitting is not about how good  is the function approximation compared to other subsets of the dataset of the same “model” works. Hence, the hold-out method (test/train) of measuring performances  does not  provide sufficient and necessary conditions to judge model’s generalisation ability: with this approach we can not detect overfitting (in Occam’s razor sense) and as well the deployment performance. 

How to mimic deployment performance?

This depends on the use case but the most promising approaches lies in adaptive analysis and detected distribution shifts and build models accordingly. However, the answer to this question is still an open research.

Sunday, 19 February 2012

Gaussians in n-dimensions

New algorithms in Gaussian Mixture Model (GMM) may sound quite oxymoron to analyze n-dimensional data sets but considering principle of parsimony it may be
the chosen one among mixture models.
(c) Copyright 2008-2024 Mehmet Suzen (suzen at acm dot org)

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.