Scientific Memo

Why the simplest explanation is always the best

2025-11-16T10:35:00.000-08:00

Preamble

The simplest explanation is always the best among the explanations that are representative. It is called the principle of parsimony, Occam's razor. It is the bedrock of scientific enlightenment. There is a recent development that people start to do a category error to discard this principle based on the following setting that lead to misunderstanding: if we say that the simplest model that can explain the data in one representation but another representation requires a more complex model, then we are choosing more complex model. This is obviously wrong. The simplest explanation is to be chosen from the models that captures the complexity on the given representations, not the simplest over both representations. Filter explanations first based on representations then selection follows. Here we shown the core idea via an illustrative example.

Figure: Circle has a zero Pearson
correlation. (Wikipedia)

Revisiting Occam’s Razor: A case of correlation and geometry

In order to understand this category error, we will work on a concrete example. Let's say we have a data, which is a circular shape $\mathscr{D}(x, y)$, as in figure. We have 3 models:

$$\mathscr{M}_{1} : y = a x + b $$

$$\mathscr{M}_{2} : y = \sqrt{1-x^{2}} $$

$$\mathscr{M}_{3} : y ~ NN(x) $$

And $\mathscr{U}$ a utility function, Pearson correlation $C(x, y)$. We consider performance measure in ranking for the utility and a measure of representation. $\mathscr{M}_{3}$ is a neural network with a lot of parameters.

There is no error in choosing $\mathscr{M}_{1}$ based on the similar correlations these models produce. Principle of parsimony is not violated at all. This is correct if we are only considering numerical representation. Pearson correlation as a utility works well for purely numerical representation. What about geometric representation? Then we need to change our utility function (representation measure or performance function).

Let's say if we use curvature as a utility, $\kappa(x,y)$, in this case $\mathscr{M}_{1}$ fails to capture curvature and its is filtered out before Occam's razor can be applied. Then we left with $\mathscr{M}_{2}$ and $\mathscr{M}_{3}$ .

Correlation and geometric explanations are two different things. Two vastly different geometries can produce the same correlations. A model can be quite good in explaining correlation but fails to capture geometric complexity. In this setting, it does not mean that Occam's razor is wrong. We need to apply Occam's razor across representations: numeric, geometric, algebraic, or symbolic, depending on the purpose. Always keep in mind the purpose or utility of model when invoking Occam's razor. Simplest explanation over the required representations that are relevant are the best.

On the utility, performance and representations measure

Performance function, representation measure and utility functions can be different in real life, here for illustration purposes we consider them interchangeably.

Conclusion

Nature minimises cost over complexity but under utility constraints. Minimal cost without satisfying utility or representation measure won't be chosen despite being the simplest. We need to filter first based on utility or representation measure before applying Occam's razor. The simplest explanation is always the best among the explanations that are representative.

Cite as

@misc{suzen25occam,

title = { Why the simplest explanation is always the best},

howpublished = {\url{https://science-memo.blogspot.com/2025/11/simplest-explanation-always-best.html}},

author = {Mehmet Süzen},

year = {2025}

}

Compressive algorithmic randomness: Gibbs-randomness proposition for massively energy efficient deep learning

2025-06-03T06:15:00.000-07:00

Figure: Dual Tomographic Compression
Performance, Süzen, 2025.
Preamble

Randomness is elusive and its probably one of the outstanding concepts for human scientific endeavour, along with gravity. Kolmogorov complexity, appears to be so novel in trying to answering "what is randomness?". The idea that the length of the smallest model that can generate the random sequence determines its complexity was a turning point in history of science. Similarly, it implies choosing the simplest model for explaining a phenomenon. That's why Kolmogorov's work was also supported by the ideas of Solomonoff and Chaitin. A recent work, explores this algorithmic information from compression perspective with Gibbs entropy.

A strange tale of path from applied research to fundamental proposition.

During study of model compression algorithm development, I have noticed an amazing behaviour that information, entropy and compression over compression process have a more in depth.

New concepts in compression and randomness via train-compress cycles

Here, we explain the new concepts for both deep learning model compression and on the interplay between compression and algorithmic randomness.

Inverse compressed sensing (iCS): Normally CS procedure is applied to reconstruct an unknown signal with fewer measurements. In the case of deep learning train-compress, weights are known at one point in the training cycle. If we create hypothetical measurements, using CS formulations, we can reconstruct weights sparse projection.

Dual Tomographic Compression: Applying iCS for the input and output of neuronal level, layer-wise, simultaneously.

Weight rays: An output reconstructed vector out of DTC; weights given sparsity level, though they are not generated in isolation but within train-compress cycle.

Gibbs randomness proposition : An extension of Kolmogorov complexity for a compression process. That, directed randomness is the same as complexity reduction, i.e., compression.

Conclusion

A new technique called DTC can be used to train deep learning with model compression on the fly. This gives rise to massively energy efficient deep learning, reaching almost ~98% reduction in energy use. Moreover, the technique also demonstrated an extended version of Kolmogorov complexity.

Ultimate physical limit of data storage: Connecting Bekenstein Bound to Landauer Principle

2025-04-22T20:27:00.000-07:00

Preamble
Bekenstein's Information
(Wikipedia)

Black holes are not too esoteric anymore after LIGO's success and successful imaging efforts. Their entropy behaves much different than the entropy of an ordinary matter. This leads to incredible discovery of so called holographic principle. The principle stating that we live on a projection of higher-dimensional manifestation of universe. On the other hand, Landauer made silently a discovery on energy expenditure of keeping information processing reversible. Similar bound put forward by Bekenstein for black holes made Landauer-Bekenstein bound for quantum gravity a natural avenue to study.

What is the Bekenstein Bound?

Basically, this puts limits the size of a black hole, via its entropy bound. Essentially it states that, entropy of a black hole $S$ is bounded with the radius of the black hole $R$ and Energy $E$, other constants being 1,

$$S \le R \dot E$$

What is Landauer Principle?

A limit on any process wants to delete 1 bit of information, has to dissipate energy proportional to its temperature,

$$ E \ge T ln 2$$

Again we made the constants to 1.

Physical limit of data storage: 1 BekensteinBytes

Using both Bekenstein bound and Landauer principle one can compute the a physical limit for a data storage on a unit sphere,

$$S \sim T$$

If we scaled this with inverse of Planck area $\ell_{p}$; One bit of information proportional to Planck area and maximum attainable temperature $10^{32}$ K is scaled with this. The final value corresponds to $\approx 10^{100}$ bits. This again corresponds to about 10 Giga-Quetta-Quatta-Quatta bytes (1 Quetta byte is $10^{30}$ bytes).

At this point in time, we can propose that $10^{100}$, would be to call 1 BekensteinBytes. However with more fine-grain computations, the number may change. Our purpose here is to give a very rough idea about the scale of ultimate physical limitations of data storage in BekensteinBytes.

Bekenstein Information Conjecture : One cannot compress more than 1 BekensteinBytes on a smallest patch of space.

Outlook

An interesting connections in quantum gravity and computation, provides certain physical limitations on how much information we can be stored at a given unit space. We called this Bekenstein Information Conjecture.

Further reading

Insights into Bekenstein entropy with an intuitive mathematical definitions: A look into Thermodynamics of Black-holes (2023)
Bekenstein Bound (Wikipedia)
Landauer's Principle (Wikipedia)
On the relation between Bekenstein entropy and Landauer’s principle, D. Song (2024) DOI
Planck Scales (Wikipedia)
Metric Prefixes (Wikipedia)

Cite as

@misc{suezen25datalimit,

title = {Ultimate physical limit of data storage: Connecting Bekenstein Bound to Landauer Principle},

howpublished = {\url{https://science-memo.blogspot.com/2025/04/ultimate-physical-limit-of-data-storage.html}},

author = {Mehmet Süzen},

year = {2025}

}

Basic understanding of a metric tensor: Disassemble the concept of a distance over Riemannian Manifolds

2024-04-26T20:00:00.000-07:00

Preamble

Gaussian Curvature (Wikipedia)

One of the core concepts in Physics is so called metric tensor. This object encodes any kind of geometry. Combined genius of Gauß, Riemann and their contemporaries lead to such a great idea, probably one of the achievements of human quantitative enlightenment. However, due to notational aspects and lack of obvious pedagogical introduction, making object elusive mathematically. Einstein's notation made this more accessible but still, it requires more explicit explanation. In this post we disassemble the definition of a distance over any geometry with an alternate notation.

Disassemble distance over a Riemannian Manifolds

The most general definition of a distance between two infinitesimal points on any geometry, or a fancy word for it, is a Riemannian Manifolds, is defined with the following definitions. Manifold is actually a sub-set of any geometry we are concerned with and Riemannian implies a generalisation.

Definition 1: (Points on a Manifold) Any two points are defined in general coordinates $X$ and $Y$, They are defined as row and column vectors respectively. In infinitesimal components. $X = [dx_{1}, dx_{2}, ..., dx_{n}]$ and $Y= [dx_{1}, dx_{2}, ..., dx_{m}]^{T}$.

Geometric description between two points are defined as tensor product, $\otimes$, that is to say we form a grid, on the geometry.

Definition 2: (A infinitesimal grid) A grid on a geometry formed by pairing up each point's components, i.e., a tensor product. This would be a matrix $P^{mn}$, (as in points), with the first row $(dx_{1} dx_{1},.....,dx_{1} dx_{n})$ and the last row $(dx_{m} dx_{1},.....,dx_{m} dx_{n})$.

Note that grid here is used as a pedagogical tool, points are actually leaves on the continuous manifold. Now, we want to compute the distance between grid-points, $dS^{n}$, then metric tensor come to a rescue

Definition 3: (Metric tensor) A metric tensor $G^{nm}$ describes a geometry of the manifold that connects the infinitesimal grid to distance, such that $dS^{n}=G^{nm} P^{mn}$.

Note that, these definition can be extended to higher-dimensions, as in coordinates are not 1D anymore. We omit the square-root on the distance, as that''s also a specific to L2 distance. Here, we can think of $dS^{n}$ a distance vector, and $G^{nm}$ and $P^{mn}$ are the matrices.

Exercise: Euclidian Metric

A familiar geometry, metric for Euclidian space, reads diagonal elements of all 1 and rest of the elements zero. How the above definitions holds, left as an exercise.

Conclusion

We have discuss that a metric tensor, contrary to its name, it isn't a metric per se but an object that describes a geometry, having magic ability to connecting grids to distances.

Further reading

There are a lot of excellent books out there but a classic Spivak's differential geometry is recommended.

Please cite as follows:

@misc{suezen24bumt,

title = {Basic understanding of a metric tensor: Disassemble the concept of a distance over Riemannian Manifolds},

howpublished = {\url{https://science-memo.blogspot.com/2024/04/metric-tensor-basic.html},

author = {Mehmet Süzen},

year = {2024}

}

Mathematical Definition of Heuristic Causal Inference: What differentiates DAGs and do-calculus?

2023-11-10T09:54:00.000-08:00

Preamble

David Hume (Wikipedia)

Experimental design is not a new concept and randomised control trials (RCTs) are our solid gold standard of doing quantitative research, when no apparent physical laws are available to validate observations. However, it is very expensive to design RCTs, not ethical or either not possible due to logistical reasons in some cases. Then we fall into Causal Inference's heuristic frameworks, such as potential outcomes, matching, and time-series interventions in imagining counterfactuals and interventions. These methods provide immensely successful toolbox for quantitative scientist where by systems do not have any known physical laws. DAGs and do-calculus, differentiates from all these approaches that try to move away from full heuristics. In this post we try to postulate this formally in mathematical terms in the context of causal inference over observational data framework. We established that DAGs and do-calculus bring mathematically more principled way of practicing causal inference akin to theoretical physics attitude.

Definition of Heuristic Causal Inference (HeuristicCI) : Observational Data

Heuristics in general implies an algorithmic approximate solution, usually appear as numerical and statistical algorithms in causal inference whereby full RCT is not available. This can be formalised as follows,

Definition (HeuristicCI) Given dataset of $n-$dimensions $\mathscr{D} \in \mathbb{R}^{n}$ observation, having variates of $X=x_{i}$, with each having different sub-sets (categories within $x_{i}$), having at least one category of observations. We want to test causal connection between two distinct subsets of $X$, $\mathscr{S}_{1} , \mathscr{S}_{2}$, given an interventional versions or imagined counterfactual where by at least one of the subset is available, $\mathscr{S}_{1}^{int} , \mathscr{S}_{2}^{int}$. Using an algorithm $\mathscr{A}$ that processes dataset to test an effect size $\delta$ using a statistic $\beta$, as follows, $$ \delta= \beta(\mathscr{S}_{1} , \mathscr{S}_{1}^{int})-\beta(\mathscr{S}_{2} , \mathscr{S}_{2}^{int})$$ statistic $\beta$ can be result of a machine learning procedure as well and difference in $\delta$ is only a particular choice, i.e., such as Average Treatment Effect (ATE). The algorithm $\mathscr{A}$ is called HeuristicCI.

Many of the non-DAGs and do-calculus methods directly falls into this category, such as potential outcomes, uplift, matching and synthetic controls. This definition could be quite obvious to practitioners that has a good handle in mathematical definitions. Moreover, HeuristicCI implies solely data-driven approach to causality inline with Hume's pure-empirical view-point.

Primary distinction in practicing DAGs that it brings causal ordering naturally [suezen23pco] with scientist's cognitive process encoded, where by HeuristicCI search for statistical effect size that has a causal component in fully data-driven way. However, a HybridCI would entails using DAGs and do-calculus in connection with data driven approaches.

Conclusion

In this short exposition, we introduced HeuristicCI concept that category of methods that do not use DAGs and do-calculus explicitly in causal inference practice. However, we do not put a well designed RCTs in this category. Because, as a gold standard approach whereby properly encoded experimental design generates full interventional data reflecting scientist's domain knowledge.

References and Further reading

Looper repo : A resource list for causality in statistics, data science and physics
[suezen23pco] Practical Causal Ordering: Why weighted DAGs are powerful for causal inference?
Related Wikipedia articles

Please cite as follows:

@misc{suezen23hci,

title = {Mathematical Definition of Heuristic Causal Inference: What differantiates DAGs and do-calculus?},

howpublished = {\url{https://science-memo.blogspot.com/2023/11/heuristic-causal-inference.html},

author = {Mehmet Süzen},

year = {2023}

}

Postscript A: Why Pearlian Causal Inference is very significant progress for empirical science?

Judea Pearl's framework for causality sometimes referred to as “mathematisation of causality”. However, “axiomatic foundations of causal inference” is fair identification, Pearl's contribution to the field is in par with Kolmogorov's axiomatic foundations of probability. Key papers of this axiomatic foundations are published in 1993 (back-doors) [1] and 1995 (do-calculus) [2].

Original works of Axiomatic foundation for causal inference:

[1] Pearl, J., “Graphical models, causality, and intervention,” Statistical Science, Vol. 8, pp. 266–269, 1993.

[2] Pearl, J., “Causal diagrams for empirical research,” Biometrika, Vol. 82, Num. 4, pp. 669–710, 1995.

Resolution of misconception of overfitting: Differentiating learning curves from Occam curves

2023-04-01T09:30:00.007-07:00

Preamble

Occam (Wikipedia)

A misconception that overfitted model can be identified with the amount of generalisation gap between model's training and test sets over its learning curves is still out there. Even in some prominent online lectures and blog posts, this misconception is now repeated without critical look. In general, this practice unfortunately diffuse into some academic papers and industrial, practitioners attribute poor generalisation to overfitting. We have provided a resolution of this via a new conceptual identification of complexity plots, so called Occam's curves differentiating from a learning curve. An accessible mathematical definitions here will clarify the resolution of the confusion.

Learning Curve Setting: Generalisation Gap

Learning curves explain how a given algorithm's generalisation improves over time or experience, originating from Ebbinghaus's work on human memory. We use inductive bias to express a model, as model can manifest itself in different forms from differential equations to deep learning.

Definition: Given inductive bias $\mathscr{M}$ formed by $n$ datasets with monotonically increasing sizes $\mathbb{T} = \{|\mathbb{T}_{0}| > |\mathbb{T}_{1}| > ...> |\mathbb{T}_{n}| \}$. A learning curve $\mathscr{L}$ for $\mathscr{M}$ is expressed by the performance measure of the model over datasets, $\mathbb{p} = \{ p_{0}, p_{1}, ... p_{n} \}$, hence $\mathscr{L}$ is a curve on the plane of $(\mathbb{T}, p)$.

By this definition, we deduce that $\mathscr{M}$ learns if $\mathscr{L}$ increases monotonically.

A generalisation gap is defined as follows.

Definition: Generalisation gap for inductive bias $\mathscr{M}$ is the difference between its' learning curve $\mathscr{L}$ and the learning curve of the unseen datasets, i.e., so-called training, $\mathscr{L}^{train}$. The difference can be simple difference, or a measure differentiating the gap.

We conjecture the following.

Conjecture: Generalisation gap can't identify if $\mathscr{M}$ is an overfitted model. Overfitting is about Occam's razor, and requires a pairwise comparison between two inductive biases of different complexities.

As conjecture suggests that generalisation gap is not about overfitting, despite the common misconception. Then, why the misconception? The misconception lies on the confusion of how to produce the curve that we could judge overfitting.

Occam Curves: Overfitting Gap [Occam's Gap]

In the case of generating Occam curves, a complexity measure $\mathscr{C}$ over different inductive biases $\mathscr{M_{i}}$ plays a role. Then the definition reads.

Definition: Given $m$ inductive bias $\mathscr{M_{i}}$ formed by $n$ datasets with monotonically increasing sizes $\mathbb{T} = \{|\mathbb{T}_{0}| > |\mathbb{T}_{1}| > ...> |\mathbb{T}_{n}| \}$. An Occam curve $\mathscr{O}$ for a given $\mathscr{M}$ is expressed by the performance measure of the model over complexity-dataset size points $\mathbb{F} = [(|\mathbb{T}_{0}|, \mathscr{C}), (|\mathbb{T}_{1}| , \mathscr{C}), ..., (|\mathbb{T}_{n}| , \mathscr{C}) ] $; Performance of a given inductive bias reads $\mathbb{p} = \{ p_{0}, p_{1}, ... p_{n} \}$, hence Occam curve, $\mathscr{O}$ is a curve on the plane of $(\mathbb{F}, p)$.

Given definition, producing Occam curves are more complicated than simply plotting test and train curves over batches. The ordering in $\mathbb{F}$ forms what is so-called goodness of rank.

Summary and take home

Resolution of misconception of overfitting lies in producing Occam curves to judge the bias-variance tradeoff, not the learning curves of a single model.

Further reading & notes

Further posts and a glossary : The concept of overgeneralisation and goodness of rank.
Double decent phenomenon, it uses Occam's curves, not learning curves.
We use dataset size as an interpretation of increasing experience, there could be other ways of expressing a gained experience, but we take the most obvious evidence.

Please cite as follows:

@misc{suezen23rmo,

title = {Resolution of misconception of overfitting: Differentiating learning curves from Occam curves},

howpublished = {\url{https://science-memo.blogspot.com/2023/04/Occam-curves.html}},

author = {Mehmet Süzen},

year = {2023}

}

Postscript notes

Take home messages

Understanding Generalisation Gap and Occam’s gap

Model selection and evaluations are usually confused by novice and as well as experienced data scientists and professionals doing modelling. There are a lot of misconceptions in the literature, but in practice primary take home messages can be summarised as follows:

1. What is a model? A model is an “inductive bias” of the modeller, a selected parametrised functions for example, a neural network architecture choice. Contrary to many, specific parametrisation of a model (deep learning architecture) is not a different model.
2. A model’s test and training performance difference is about generalisation gap. Overfitting and under-fitting is not about generalisation gap.
3. Overfitting or under-fitting is a comparison problem: How a model deviates from a reference model? This is called Occam’s gap or so called model selection error.
4. Occam’s gap generalises Empirical Risk minimisation over a learning curve. Empirical risk minimisation itself is not about learning.

How and when a model generalises well and generalisation of empirical risk minimisation are currently an open research topics.

Loschimidt's Paradox and Causality: Can we establish Pearlian expression for Boltzmann's H-theorem?

2023-02-25T11:19:00.003-08:00

Boltzmann (Wikipedia)
Post covers the papers: H-theorem do-conjecture, M. Süzen, arxiv:2310.01458 (2023)

Preamble

Probably the most important achievement for humans is the ability to produce scientific discoveries, that helps us objectively understand how nature works and build artificial tools where no other species can. Entropy is an elusive concept and one of the crown achievements of human race. We question here if causal inference and Loschmidt's paradox can be reconciled.

Mimicking analogies are not physical

Before even try to understand what is a physical entropy, we should make sure that there is only one kind of physical entropy from thermodynamics, formulated by Gibbs-Boltzmann ($S_{G}$ and $S_{B}$). Other entropies such as Shannon's information entropy are all analogies to physics, and mimicking concepts.

Why counting microstates are associated with time?

The following definition of entropy is due to Boltzmann but Gibbs' formulation tend to provide equivalence, technically different formulations aside, they are actually equivalent.

Definition 1: An entropy of a macroscopic material is associated with larger number of states its constituted elements take different states, $\Omega$. This is associated with $S_{B}$, Boltzmann's entropy.

Now, as we know from basic thermodynamics classes that entropy change of a system can not decrease, so the time's arrow.

Definition 2: Time's arrow is identified with change in entropy of material systems, that $\delta S \ge 0$.

We put aside the distinction between open and close systems and equilibrium and non-equilibrium dynamics, but concentrate on how come counting system's state's are associated with time's arrow?

Loschimidt's Paradox: Irreversible occupancy on discrete states and causal inference

The core idea probably can be explained via discrete lattice and occupancy on them over chain of dynamics.

Conjecture 1: Occupancy of $N$ items on $M$ discrete states, $M>N$, evolving with dynamical rules $\mathscr{D}$ necessarily increases $\Omega$, compare to the number of sampling if it were $M=N$.

This conjecture might explain the entropy increase, but irreversibility of the dynamical rule $\mathscr{D}$ is required addressing Loschimidt's Paradox, i.e., how to generate irreversible evolution given time-reversal dynamics. Actually, do-calculus may provide a language to resolve this, by inducing interventional notation on Boltzmann's H-theorem with Pearlian view. The full definition of H-function is a bit more involved, but here we summarise it in condensed form with a do operator version of it.

Conjecture 2 (H-Theorem do-conjecture): Boltzmann's H-function provides a basis for entropy increase, it is associated with conditional probability of a system $\mathscr{S}$ being in state $X$ on ensemble $\mathscr{E}$. Hence, $P(X|\mathscr{E})$. Then, an irreversible evolution from time-reversal dynamics should use interventional notation, $P(X|do(\mathscr{E}))$. Then information on how time reversal dynamics leads to time's arrow encoded on, how dynamics provides an interventional ensembles, $do(\mathscr{E})$.

Conclusion

We provided some hints on why would counting states lead to time's arrow, an irreversible dynamics. In the light of the development of mathematical language for causal inference in statistics, the concepts are converging. Along with understanding Loschmidt's Paradox via do-calculus, it can establish an asymmetric notation. Loschmidt's question is long standing problem in physics and philosophy with great practical implications in different physical sciences.

Further reading

Loschimid's Paradox
H-Theorem
do-Calculus revisited, J. Pearl (2012) pdf
Causal Inference : Looper Repository for collection of resources.
H-theorem do-conjecture, M. Süzen, arxiv:2310.01458 (2023)

Please cite as follows:

@misc{suezen23lpc,

title = {Loschimidt's Paradox and Causality: Can we establish Pearlian expression for Bolztmann's H-theorem?},

howpublished = {\url{https://science-memo.blogspot.com/2023/02/loschimidts-do-calculus.html}},

author = {Mehmet Süzen},

year = {2023}

}

@article{suzen23htd,

title={H-theorem do-conjecture},

author={Mehmet Süzen},

preprint={arXiv:2310.01458},

url = {https://arxiv.org/abs/2310.01458}

year={2023}

}

Insights into Bekenstein entropy with an intuitive mathematical definitions: A look into Thermodynamics of Black-holes

2023-02-18T09:46:00.002-08:00

Jacob Bekenstein
(Wikipedia)

Preamble

Thermodynamics of black holes has appeared as one of the most interesting areas of research in theoretical physics [Wald1994], specially after LIGO's massive success. The striking results of Jacob Bekenstein [Bekenstein1973] in proposing a formulation of entropy for a black hole was on of the most striking turning point in building explanations for the thermodynamics of gravitational systems. Bekenstein entropy is defined to be so-called a phenomenological relationship and surprisingly easy to understand concept using basic dimensionality analysis. In this post, we will show how to understand the entropy of a black hole just using basic dimensionality analysis, fundamental physics constants and basic definition of entropy.

Dimensions and scales

Dimensionality analysis appears in many different areas of physics and engineering, from fluid dynamics to relativity. The starting point is to understand the concept of dimensions. Every quantity we measure in real life has a dimension. It means a quantity $\mathscr{Q}$ we obtain from a measurement $\mathscr{M}$ has a numeric value $v$ and associated unit $u$. $\mathscr{Q}=\langle v, u \rangle$ given $\mathscr{M}$. There are 3 distinct fundamental unit types length (L), time (T) and mass (M).

Intuitive Bekenstein entropy (BE) for a black hole : Informal mathematical definition

Black holes are astronomical objects that are not directly observable due to their mass condensed in a small area. The primary object we will use is something called Planck length $L_{p}$ and it implies physically possible smallest patch of the space-time, this is associated with the state of the black holes on their horizon. We won't define the Planck length here in detail but with the knowledge of fundamental physics constants and dimensional analysis we mentioned, one can get a constant value for this length.

Definition: Finite entropy $S_{f}$ of an object is associated with the number of states $\Omega$ a system can attain.

If we combine this definition for a black hole entropy :

Definition Finite entropy of a black-hole $S_{f}^{BH}$ is associated with the number of its states $\Omega$, number of elements on it's surface area of $A$. The elements are discretised with small patches $a_{p}=L_{p}^{2}$. Then intuitively, $\Omega$ yields to $A$ divided by $a_{p}$.

Bekenstein entropy is not thermodynamic entropy alone and family of Bekenstein entropies

The unit analysis tells us that $A$ has a dimension of length square. We intentionally omit any equality in the above definition upon $S_{f}^{BH}$ because, in practice Bekenstein Entropy is not thermodynamic entropy alone. The formulation usually presented as BE in general uses equality for the above approach. However this is not strictly thermodynamical alone, that's why we specify definitions as finite entropy and only express the relationship as association. Similarly any other constants as it can yield to different Bekenstein entropies such as introduction of new constants would yield to family of Bekenstein entropies.

Why surface area defines states of a black-hole?

This is an amazing question and Bekenstein's main contribution is to associate this to number of states of a black-hole on event horizon, i.e., point of of no return layer whereby ordinary matter can't return. The justification is that all other properties of black hole defines this surface. Here is the intuitive definition of states of black-hole.

Definition A surface area $\mathscr{A}$ is formed by the set of physical properties forming an ensembles. such as charge density, angular momentum. These ensembles indirectly samples thermodynamics ensembles.

Even though intuition is there, this question might still be an open question further.

Conclusion

We provided the primary idea that Bekenstein tried to convey in his 1973 paper intuitively. However, we identify its thermodynamic limit is an open research area. Thermodynamic limit implies that taking infinite limit of both area and the discretised areas, even though it sounds that the values might converge to infinity, simultaneous limit would converge to a finite value for a physical matter.

Primary Papers

Bekenstein J.D.: Lettere al Nuovo Cimento, 4, 737, (1972)
Bekenstein J.D.: Physical Review D, 7, 2333, (1973)
Bekenstein J.D.: Physical Review D, 9, 3292 (1974)
Bekenstein J.D.: Physical Review D, 12, 3077 (1975)

Primary Book

Wald, Quantum field theory in curved space times (1994)

Please cite as follows:

@misc{suezen23ibe,

title = {Insights into Bekenstein entropy with an intuitive mathematical definitions},

howpublished = {\url{https://science-memo.blogspot.com/2023/02/bekenstein-entropy.html},

author = {Mehmet Süzen},

year = {2023}

}

Postscript A:

Information can’t be destroyed

Proposals of that information is destroyed out of thin air is a red flag for any physical theory: this includes theories on evaporating black holes. Bekenstein’s insight in this direction that surface area is associated with entropy. The black-holes’ information in this context is quite different than the Shannon’s entropy. An evaporating black-hole, the area approaching to zero is not the same as information going to zero, surface area is a function of physical properties of the stellar object that bound by conservation laws in their interaction with their surrounding. Hence, the information is preserved even if area goes to zero.

Postscript B:

What is Holographic principle? its origins from Bekenstein Entropy perspective

The word embedding applies in this context as well. Embedding implies some sort of dimensionality projection. A projection to lower dimensional space, or on the other end, to the higher dimensional space. Holography is no different. Imagine taking 2D snap shots of rotating 3D objects, generating this in reverse is the end effect of holographic reconstruction. N-dimension to (N-1) projection. This is the bases of holographic principle: entropy of black-holes doesn’t appear as all states of its constituted matter, as normally should have for ordinary matter, it manifest as N-1 projection on it’s surface. This kind of holographic entropy is first noted by Bekenstein; whereby he assigned the event-horizon area as a representation of the states of the black-hole volume. This projection to (N-1)-dimension is improved upon Bekenstein’s approach to generalised situations in explaining how universe might be a hologram entirely by Gerard 't Hooft and Leonard Susskind. Holographic principle, probably one of the most important development in theoretical physics in recent times.

Misconceptions on non-temporal learning: When do machine learning models qualify as prediction systems?

2023-01-28T10:07:00.002-08:00

Preamble

Babylonian Tablet for
square root of 2.
(Wikipedia)

Prediction implies a mechanics, as in knowing a form of a trajectory over time. Strictly speaking a predictive system implies knowing a solution to the path, set of variable depending on time, time evolution of the system under consideration. Here, we define semi-informally how a prediction system is defined mathematically and show how non-temporal learning can be mapped into a prediction system.

Temporal learning : Recurrence, trajectory and sequences

A trajectory can be seen as a function of time, identified in recurrence manner. It means $x(t_{i})=f(x_{i-1})$. However, this is one of the possible definitions. The physical equivalent of this appears as a solution to ordinary differential equation, such as the velocity $v(t) = dx(t)/dt$, recurrence on its solution. On the other hand machine learning, an empirical approach is taken and a sequence data such as natural language or a log events occurring in sequence. Any modelling on such data is called temporal learning. This includes classical time-series algorithms, gated units in deep learning and differential equations.

Definition: A prediction system $\mathscr{F}$ that is build with data $D$ but utilised for a data that is not used in building it $D'$, qualified as such if both $D$ and $D'$ are temporal sets and output of the system is a horizon $\mathbb{H}$, that is a sequence.

Using non-temporal supervised learning is interpolation or extrapolation

Often practice in industry to turn temporal interactions into flat set of data vectors, $v_{i}$, $i$ corresponds to a time point or an arbitrary property of the dataset resulting in breaking the temporal associations and causal links. This could also manifest as set of images with some labels which has no ordering or associational property in the dataset. Even though our system build upon these non-temporal datasets, indeed it constituted a learning systems as interpolation or extrapolation. Their utility in using them for $D'$, strictly speaking does not qualify as prediction systems.

Mapping with pre-processing

A mapping indeed possible from non-temporal data to a temporal one, if their original form is not in temporal form yet. This is been studied in complexity literature. This requires an algorithm to map flattened data vectors we mentioned into a sequence data.

Mapping with Causality

A distinct models from causal inference are qualified as predictive systems even if they are trained on non-temporal data, because causality establishes a temporal learning.

Non-temporal modals: Do they still learn?

Even though, we exclude non-temporal model utilisation as non-predictive systems, they still classified as learned models. Because their outputs are generated by a learning procedure.

Conclusion

Differentiation among temporal and non-temporal learning is provided in associational manner. This results into definition of a prediction system, that excludes non-temporal machine learning models: such as models for unlinked set of vectors, a set of numbers mapped from any data modality.

Further reading & postscript notes

Practice causal inference: Conventional supervised learning can't do inference
Causal inference : Editor's selections from the looper repo.
Causal models usually are not train but validated or so called discovered.

The concept of overgeneralisation and goodness of rank : Overfitting is not about comparing training and test learning curves

2022-12-20T10:52:00.003-08:00

Preamble

Walt Disney Hall,
Los Angeles (Wikipedia)

Unfortunately, it is still thought in machine learning classes that overfitting can be detected by comparing training and test learning curves on the single model's performance. The origins of this misconception is unknown. Looks like an urban legend has been diffused into main practice and even in academic works the misconception taken granted. Overfitting's definition appeared to be inherently about comparing complexities of two (or more) models. Models manifest themself as inductive biases modeller or data scientist inform in their tasks. This makes overfitting in reality a Bayesian concept at its core. It is not about comparing training and test learning curves that if model is following a noise, but pairwise model comparison-testing procedure to select more plausable belief among our beliefs that has the least information: entities should not be multiplied beyond necessity i.e., Occam's razor. We introduce a new concept in clarifying this practically, goodness of rank to distinguish from well known goodness of fit, and clarify concepts and provide steps to attribute models with overfitted or under-fitted models.

Poorly generalised model : Overgeneralisation or under-generalisation

The practice that is described in machine learning classes, and practiced in industry that overfitting is about your model following training set closely but fails to generalised in test set. This is not overfitted model but a model that fails to generalise: a phenomena should be called Overgeneralisation (or under-generalisation).

A procedure to detect overfitted model : Goodness of rank

We have provided complexity based abstract description of model selection procedure, here as complexity ranking: we will repeat this procedure with identification of the overfilled model explicitly.

In the following steps a sketch of an algorithmic recipe for complexity ranking of inductive biases via informal steps, overfitted model identification explicitly:

Define a complexity measure $\mathscr{C}$($\mathscr{M}$) over an inductive bias.
Define a generalisation measure $\mathscr{G}$($\mathscr{M}$, $\mathscr{D}$) over and inductive bias and dataset.
Select a set of inductive biases, at least-two, $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$.
Produce complexity and generalisation measures on ($\mathscr{M}$, $\mathscr{D}$): Here for two inductive biases: $\mathscr{C}_{1}$, $\mathscr{C}_{2}$, $\mathscr{G}_{1}$, $\mathscr{G}_{2}$.
Ranking of $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$: $argmax \{ \mathscr{G}_{1}, \mathscr{G}_{2}\}$ and $argmin \{ \mathscr{C}_{1}, \mathscr{C}_{2}\}$
$\mathscr{M}_{1}$ is an overfitted model compare to $\mathscr{M}_{2}$ if $\mathscr{G}_{1} <= \mathscr{G}_{2}$ and $\mathscr{C}_{1} \gt \mathscr{C}_{2}$.
$\mathscr{M}_{2}$ is an overfitted model compare to $\mathscr{M}_{1}$ if $\mathscr{G}_{2} <= \mathscr{G}_{1}$ and $\mathscr{C}_{2} \gt \mathscr{C}_{1}$.
$\mathscr{M}_{1}$ is an underfitted model compare to $\mathscr{M}_{2}$ if $\mathscr{G}_{1} < \mathscr{G}_{2}$ and $\mathscr{C}_{1} < \mathscr{C}_{2}$.
$\mathscr{M}_{2}$ is an underfitted model compare to $\mathscr{M}_{1}$ if $\mathscr{G}_{2} < \mathscr{G}_{1}$ and $\mathscr{C}_{2} < \mathscr{C}_{1}$.

If two model has the same complexity, then much better generalised model should be selected, in this case we can't conclude that either model is overfitted but generalised differently. Remembering that overfitting is about complexity ranking : Goodness of rank.

But overgeneralisation sounded like overfitting, isn't it?

Operationally overgeneralisation and overfitting implies two different things. Overgeneralisation operationally can be detected with a single model. Because, we can measure the generalisation performance of the model alone with data, in statistical literature this is called goodness of fit. Moreover overgeneralisation can also be called under-generalisation, as both implies poor generalisation performance.

However, overfitting implies a model overly performed compare to an other model i.e., model overfits but compare to what? Practically speaking, overgeneralisation can be detected via holdout method, but not overfitting. Overfitting goes beyond goodness of fit to goodness of rank as we provided recipe as pairwise model comparison.

Conclusion

The practice of comparing training and test learning curves for overfitting diffused into machine learning so deeply, the concept is almost always thought a bit in a fuzzy-way, even in distinguished lectures explicitly. Older textbooks and papers correctly identifies overfitting as comparison problem. As a practitioner, if we bear in mind that overfitting is about complexity ranking and it requires more than one model or inductive bias in order to be identified, then we are in better shape in selecting better model. Overfitting can not be detected via data alone on a single model.

Further reading

Some of the posts, reverse chronological order, that this blog have tried to convey what overfitting entails and its general implications.

Glossary

To make things clear, we provide concept definitions.

Generalisation A concept that if model can perform as good as the data it has not seen before, however seen here is a bit vague, it could have seen data points that are close to the data would be better suited in the context of supervised learning as oppose to compositional learning.

Goodness of fit An approach to check if model is generalised well.

Goodness of rank An approach to check if model is overfitted or under-fitted comparable to other models.

Holdout method A metod to build a model on the portion of available data and measure the goodness of fit on the holdout part of the data, i.e., test and train.

Inductive bias A set of assumptions data scientist made in building a representation of the real world, this manifest as model and the assumptions that come with a model.

Model A model is a biased view of the reality from data scientist. Usually appears as a function of observables $X$ and parameters $\Theta$, $f(X, \Theta)$. The different values of $\Theta$ do not constitute a different model. See also What is a statistical model?, Peter McCullagh

Occam's razor (Principle of parsimony) A principle that less complex explanation reflects reality better. Entities should not be multiplied beyond necessity.

Overgeneralisation (Under-generalisation) If we have a good performance on the training set but very bad performance on the test set, model said to overgeneralise or under-generalise; as a result of goodness of fit testing, i.e., comparing learning curves over test and train datasets.

Regularisation An approach to augment model to improve generalisation.

Postscript Notes

Note: Occam’s razor is a ranking problem: Generalisation is not

The holy grail of machine learning in practice is hold-out methods. We want to make sure that we don’t overgeneralise. However, a misconception has been propagated that overgeneralisation is mistakenly thought of as synonymous with overfitting. Overfitting has a different connotation as ranking different models rather than measuring the generalisation ability of a single model. The generalisation gap between training and test sets is not about Occam’s razor.

The conditional query fallacy: Applying Bayesian inference from discrete mathematics perspective

2022-12-05T10:16:00.001-08:00

Preamble

The Tilled Field,
Joan Miró
(Wikipedia)

One of the core concepts in data sciences is conditional probabilities, $p(x|y)$ appear as logical description of many of the tasks, such as formulating regression or as a core concept in Bayesian Inference. However, there is operationally no special meaning of a conditional or joint probabilities as their arguments are no more than a compositional event statements. This raise a question: Is there any fundamental relationship between Bayesian Inference and discrete mathematics that is practically relevant to us as practitioners? Since, both topics are based on discrete statements returning a Boolean variables. Unfortunately, the answer to this question is a rabbit hole and probably even an open research. There is no clearly established connections between discrete mathematics fundamentals and Bayesian Inference.

Statement mappings as definition of probability

Statement is a logical description of some events, or set of events. Let's have a semi-formal description of such statements.

Definition: A mathematical or logical statement formed with boolean relationships $\mathscr{R}$ (conjunctions) among set of events $\mathscr{E}$, so a statement $\mathbb{S}$ is formed with at least a tuple of $\langle \mathscr{R}, \mathscr{E} \rangle$.

Relationships can be any binary operator and events could explain anything perceptional, i.e., a discretised existence. This is the core discrete mathematics and almost all problems in this domain formed in this setting from defining functions to graph theory. A probability is no exception and definition naturally follows, as so called statement mapping.

Definition: A probability $\mathbb{P}$ is a statement mapping, $\mathbb{P}: \mathbb{S} \rightarrow [0,1]$.

The interpretation of this definitions that a logical statement is always True if probability is 1 and always False if it is 0. However, having conditionals based on this is not that clear cut.

Conditional Query Fallacy

A non-commutative statement may imply, reversing the order of statements should not yield to the same filtered set on the data for Bayesian Inference. However, Bayes' theorem would have a fallacy for statement mappings for conditionals in this sense.

Definition: The conditional query fallacy is defined as one can not update belief in probability, because reversing order of statements in conditional probabilities halts Bayes' update, i.e., back to back query results into the same dataset for inference.

At first glance, this appears as a Bayes' rule does not support commutative property, practically posterior being equal to likelihood. However, this fallacy appears to be a notational misdirection. Inference on the filtered dataset back to back constituting a conditional fallacy i.e., when a query language is used to filter data to get A|B and B|A yielding to the same dataset regardless of filtering order.

Although, in inference with data, likelihood is actually not a conditional probability, strictly speaking and not a filtering operation. It is merely a measure of update rule. We compute likelihood by multiplying values obtained by i.i.d. samples inserted into conjugate prior, a distribution is involved. Hence, the likelihood computationally is not really a reversal of conditional as in $P(A|B)$ written as reversed, $P(B|A)$.

Outlook

In computing conditional probabilities for Bayesian Inference, our primary assumption is that conditional probabilities; likelihood and posterior are not identical. Discrete mathematics only allows Bayesian updates, if time evolution is explicitly stated with non-commutative statements for conditionals.

Going back to our initial question, indeed there is a deep connection between the fundamentals of discrete mathematics and Bayesian belief update on events as logical statements. The fallacy sounds a trivial error in judgement but (un)fortunately goes into philosophical definitions of probability that simultaneous tracking of time and sample space is not encoded in any of the notations explicitly, making statement filtering definition of probability a bit shaky.

Glossary of concepts

Statement Mapping A given set of mathematical statements mapped into a domain of numbers.

Probability A statement mapping, where domain is $\mathscr{D} = [0,1]$.

Conditional query fallacy Differently put than the above definition. Thinking that two conditional probabilities as reversed statements of each other in Bayesian inference, yields to the same dataset regardless of time-ordering of the queries.

Notes and further reading

Fallacy is one computes $P(A|B)=P(B|A)$, while filtering results into identical datasets. Correction would be that, one needs to use different sample sizes for reverse statement or compute joints and marginals separately on their own filtered datasets. Use the first filtering sample size in computing the probability not the total.
Here, discrete mathematics we refer to appears within arguments of probability. The discussion of discrete parameter estimations are a different topic. Gelman discusses this, here.
Conjunction Fallacy
Probability Interpretations
Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra M. Süzen (2022)
Holes in Bayesian Statistics Gelman-Yao (2021) : This is a beautifully written article. Specially, proposal that context dependence should be used instead of subjective
John Allen Paulos’s Innumeracy: Mathematical Illiteracy and Its Consequences,
Random House (1988). Interestingly Paullos mentions fallacies regarding the conditional probabilities.

An oldy book on numerical fallacies but goodie.

Differentiating ensembles and sample spaces: Alignment between statistical mechanics and probability theory

2022-11-15T12:28:00.000-08:00

Preamble

Sample space is the primary concept introduced in any probability and statistics books and in papers. However, there needs to be more clarity about what constitutes a sample space in general: there is no explicit distinction between the unique event set and the replica sets. The resolution of this ambiguity lies in the concept of an ensemble. The concept is first introduced by American theoretical physicist and engineer Gibbs in his book Elementary principle of statistical mechanics. The primary utility of an ensemble is a mathematical construction that differentiates between samples and how they would form extended objects.

In this direction, we provide the basics of constructing ensembles in a pedagogically accessible way from sample spaces that clears up a possible misconception. This usage of ensemble prevents the overuse of the term sample space for different things. We introduce some basic formal definitions.

Figure: Gibbs's book
introduced the concept of
ensemble (Wikipedia).

What Gibbs's had in mind by constructing statistical ensembles?

A statistical ensemble is a mathematical tool that connects statistical mechanics to thermodynamics. The concept lies in defining microscopic states for molecular dynamics; in statistics and probability, this corresponds to a set of events. Though these events are different at a microscopic level, they are sampled from a single thermodynamics ensemble, a representative of varying material properties or, in general, a set of independent random variables. In dynamics, micro-states samples an ensemble. This simple idea has helped Gibbs to build a mathematical formalism of statistical mechanics companion to Boltzmann's theories.

Differentiating sample space and ensemble in general

The primary confusion in probability theory on what constitutes a samples space is that there is no distinction between primitive events or events composed of primitive events. We call both sets sample space. This terminology easily overlooked in general as we concentrate on events set but not the primitive events set in solving practical problems.

Definition: A primitive event $\mathscr{e}$ implies a logically distinct unit of experimental realisation that has not composed of any other events.

Definition: A sample space $\mathscr{S}$ is a set formed by all $N$ distinct primitive events $\mathscr{e}_{i}$.

By this definition, regardless of how many fair coins are used or if a coin toss in a sequence for the experiment, the sample space is always ${H,T}$, because these are the most primitive distinct events a system can have, i.e., a single coin outcomes. However, the statistical ensemble can be different. For example for two fair coins or coin toss in sequence of length two, corresponding ensemble of system size two reads ${HH, TT, HT, TH}$. Then, the definition of ensemble follows.

Definition: An ensemble $\mathscr{E}$ is a set of ordered set of primitive events $\mathscr{e}_{i}$. These event sets can be sampled with replacement but order matters, i.e., $ \{e_{i}, e_{j} \} \ne \{e_{j}, e_{i} \}$, $i \ne j$.

Our two coin example's ensemble should be formally written as $\mathscr{E}=\{\{H,H\}, \{T,T\}, \{H,T\}, \{T,H\}\}$, as order matters members $HT$ and $TH$ are distinct. Obviously for a single toss ensemble and a sample space will be the same.

Ergodicity makes the need for differentiation much more clear : Time and ensemble averaging

The above distinction makes building time and ensemble averaging much easier. The term ensemble averaging is obvious as we know what would be the ensemble set and averaging over this set for a given observable. Time averaging then could be achieved by curating a much larger set by resampling with replacement from the ensemble. Note that the resulting time-average value would not be unique, as one can generate many different sample sets from the ensemble. However, bear in mind that the definition of how to measure convergence to ergodic regime is not unique.

Conclusion

Even though the distinction we made sounds very obscure, this alignment between statistical mechanics and probability theory may clarify the conception of ergodic regimes for general practitioners.

Further reading

Please Cite:

@misc{suezen22dess,

title = {Differentiating ensembles and sample spaces: Alignment between statistical mechanics and probability theory},

howpublished = {\url{https://science-memo.blogspot.com/2022/11/ensembles-probability-theory.html},

author = {Mehmet Süzen},

year = {2022}

}

Postscript

If there are multiple events coming from set of primitive events, compositional outcomes considered to be ensemble not sample space. Sample space is a set that we sample from, either one or multiple times to build an ensemble. Ensemble notion within pure ML context was also noticed by late David J. C. MacKay, in his book Information Theory, Inference and Learning, Cambridge University Press (2003).

Overfitting is about complexity ranking of inductive biases : Algorithmic recipe

2022-10-25T15:22:00.001-07:00

Preamble

Figure: Moon patterns
human brain
invents. (Wikipedia)

Detecting overfitting is inherently a comparison problem of the complexity of multiple objects, i.e., models or an algorithm capable of making predictions. A model is overfitted (underfitted) if we only compare it to another model. Model selection involves comparing multiple models with different complexities. The summary of this approach with basic mathematical definitions is given here.

Misconceptions: Poor generalisation is not synonymous with overfitting.

None of these techniques would prevent us from overfitting: Cross-validation, having more data, early stopping, and comparing test-train learning curves are all about generalisation. Their purpose is not to detect overfitting.

We need at least two different models, i.e., two different inductive biases, to judge which model is overfitted. One distinct approach in deep learning, called dropout, prevents overfitting while it alternates between multiple models, i.e., multiple inductive bias. For judgment, dropout implementation has to compare those alternating model test performances during training to judge overfitting.

What is an inductive bias?

There are multiple inceptions of inductive bias. Here, we concentrate on a parametrised model, $\mathscr{M}(\theta)$ on a dataset $\mathscr{D}$, the selection of a model type, or modelling approach, usually manifest as a functional form $\mathscr{M}=f(x)$ or as a function approximation, i.e., for example neural network, are all manifestation of inductive biases. Different parameterisation of model learned on the subsets of the dataset are still the same inductive bias.

Complexity ranking of inductive biases: An Algorithmic recipe

We are sketching out an algorithmic recipe for complexity ranking of inductive biases via informal steps:

Define a complexity measure $\mathscr{C}$($\mathscr{M}$) over an inductive bias.
Define a generalisation measure $\mathscr{G}$($\mathscr{M}$, $\mathscr{D}$) over and inductive bias and dataset.
Select a set of inductive biases, at least-two, $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$.
Produce complexity and generalisation measures on ($\mathscr{M}$, $\mathscr{D}$): Here for two inductive biases: $\mathscr{C}_{1}$, $\mathscr{C}_{2}$, $\mathscr{G}_{1}$, $\mathscr{G}_{2}$.
Ranking of $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$: $argmax \{ \mathscr{G}_{1}, \mathscr{G}_{2}\}$ and $argmin \{ \mathscr{C}_{1}, \mathscr{C}_{2}\}$

The core concept appears as when generalisations are close enough we pick out the inductive bias that is less complex.

Conclusion & Outlook

In practice, probably due to hectic delivery constraints, or mere laziness, we still rely on simple holdout method to build models, only single test and train split, not even learning curves, specially in deep learning models without practicing Occam's razor. A major insight in this direction appears to be that, holdout approach can only help us to detect generalisation, not overfitting. We clarify this via the concept of inductive bias distinguishing that different parametrisation of the same model doesn't change the inductive bias introduced by the modelling choice.

In fact, due to resource constraints of model life-cycle, i.e., energy consumption and cognitive load of introducing a complex model, practicing proper Occam's razor: complexity ranking of inductive biases, is much more important than ever for sustainable environment and human capital.

Further reading

Some of the posts, reverse chronological order, that this blog have tried to convey what overfitting entails and its general implications.

Heavy-matter-wave and ultra-sensitive interferometry: An opportunity for quantum-gravity becoming an evidence based research

2022-10-04T12:10:00.005-07:00

Solar Eclipse of 1919
(wikipedia)

Preamble

Cool ideas in theoretical physics are ofter opaque for general reader whether if they are backed up with any experimental evidence in the real world. The success of LIGO (Laser Interferometer Gravitational-wave Observatory) definitely proven the value of interferometry for advancement of cool ideas of theoretical physics supported by real world measurable evidence. An other type of interferometry that could be used in testing multiple-different ideas from theoretical physics is called matter-wave interferometry or atom interferometry: It's been around decades but the new developments and increased sensitivity with measurement on heavy atomic system-waves will pave the technical capabilities to test multiple ideas of theoretical physics.

Basic mathematical principle of interferometry

Usually interferometry is explained with device and experimental setting details that could be confusing. However, one could explain the very principle without introducing any experimental setup. The basic idea of of interferometry is that if a simple wave, such as $\omega(t)=\sin\Theta(t)$, is first split into two waves and reflected over the same distance, one with shifted with a constant phase, in the vacuum without any interactions. A linear combination of the returned waves $\omega_{1}(t)=\sin \Theta(t)$ and $\omega_{2}(t)=\sin( \Theta(t) + \pi))$, will yield to zero, i.e., an interference pattern generated by $\omega_{1}(t)+\omega_{2}(t)=0$. This very basic principle can be used to detect interactions and characteristics of those interactions wave encounter over the time it travels to reflect and come back. Of course, the basic wave used in many interferometry experiments is the laser light and interaction we measure could be gravitational wave that interacts with the laser light i.e., LIGO's set-up.

Detection of matter-waves : What is heavy and ultra-sensitivity?

Each atomic system exhibits some quantum wave properties, i.e., matter waves. It implies a given molecular system have some wave signatures-characteristics which could be extracted in the experimental setting. Instead of laser light, one could use atomic system that is reflected similar to the basic principle. However, the primary difference is that increasing mass requires orders of magnitude more sensitive wave detectors for atomic interferometers. Currently heavy means usually above ~$10^{9}$ Da (comparing to Helium-4 which is about ~4 Da), these new heavy atomic interferometers might be able to detect gravitational-interactions within quantum-wave level due to precisions achieved ultra-sensitive. This sounds trivial but experimental connection to theories of quantum-gravity, one of the unsolved puzzles in theoretical-physics appears to be a potential break-through. One prominent example in this direction is entropic gravity and wave-function collapse theories.

Conclusion

Recent developments in heavy matter-wave interferometry could be leveraged for testing quantum-gravity arguments and theoretical suggestions. We try to bring this idea into general attention without resorting in describing experimental details.

Further Reading & Notes

Dalton, mass-unit used in matter-wave interferometry.
Atom Interferometry by Prof. Pritchard YouTube.
Newton-Schrödinger equation.

Papers of Kingsley R. W. Jones are also very novel in this direction.

A roadmap for universal high-mass matter- wave interferometry Kilka et. al. AVS Quantum Sci. 4, 020502 (2022). doi

Current capabilities as of 2022, atom interferometers can reach up to ~300 kDa.

Testing Entropic gravity, arXiv.
NASA early stage ideas workshops : web-archive

Building robust AI systems: Is an artificial intelligent agent just a probabilistic boolean function?

2022-09-20T12:36:00.005-07:00

Preamble

George Boole (Wikipedia)

Agent, AI agent or an intelligent agent is used often to describe algorithms or AI systems that are released by research teams recently. However, the definition of an intelligent agent (IA) is a bit opaque. Naïvely thinking, it is nothing more than a decision maker that shows some intelligent behaviour. However, making a decision intelligently is hard to quantify computationally, and probably IA for us is something that can be representable as a Turing machine. Here, we argue that an intelligent agent in the current AI systems should be seen as a function without side effects outputting a boolean output and shouldn't be extrapolated or compare to human level intelligence. Causal inference capabilities should be seen as a scientific guidance to this function decompositions without side-effects, i.e., Human in-the loop Probabilistic Boolean Functions (PBFs).

Computational learning theories are based on binary learners

Two of the major theories of statistical learning PAC and VC dimensions build upon on "binary learning".

PAC stands for Probably Approximately Correct, It sets basic framework and mathematical building blocks for defining a machine learning problem from complexity theory. Probably correct implies finding a weak learning function given binary instance set $X=\{1,0\}^{n}$. The binary set or its subsets mathematically called concepts and under certain mathematical conditions a system said to be PAC learnable. There are equivalences to VC and other computation learning frameworks.

Robust AI systems: Deep reinforcement learning and PAC

Even though the theory of learning on deep (reinforcement) learning is not established and active area of research. There is an intimate connection with composition of concepts, i.e., binary instance subsets as almost all operations within deep RL can be viewed as probabilistic Boolean functions (PBFs).

Conclusion

Current research and practice in robust AI systems could focus on producing learnable probabilistic boolean functions (PBFs) as intelligent agents, rather than being a human level intelligent agents. This modest purpose might bring more practical fruit than long-term aims of replacing human intelligence. Moreover, theory of computation for deep learning and causality could benefit from this approach.

Further reading

Valiant84. Theory of Learnable.
VC Dimension.
Modern Theory and Machine Learning, Chase-Freitag, 2018

Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra

2022-07-05T11:32:00.000-07:00

Preamble

The White Rabbit
(Wikipedia)

A novice analyst or even experienced (data) scientist would have thought that the bar notation $|$ in representing conditional probability carries some different operational mathematics. Primarily when written in explicit distribution functions $p(x|y)$. Similar approach applies to joint probabilities such as $p(x, y)$ too. One could see a mixture of these, such as $p(x, y | z)$. In this short exposition, we clarify that none of these identifications within arguments of probability do have any different resulting operational meaning.

Arguments in probabilities: Boolean statement and filtering

Arguments in any probability are mathematical statements of discrete mathematics that correspond to events in the experimental setting. These are statements declaring some facts with a boolean outcome. These statements are queries to a data set. Such as, if the temperature is above $30$ degrees, $T > 30$. Temperature $T$ is a random variable. Unfortunately, the term random variable is often used differently in many textbooks. It is defined as a mapping rather than as a single variable. The bar $|$ in conditional probability $p(x|y)$, implies statement $x$ given that statement $y$ has already occurred, i.e., if. This interpretation implies that $y$ first occurred before $x$, but it doesn't imply that they are causally linked. The condition plays a role in filtering, a where clause in query languages. $p(x|y)$ boils down to $p_{y}(x)$, where the first statement $y$ is applied to the dataset before computing the probability on the remaining statement $x$.

In the case of joint probabilities $p(x, y)$, events co-occur, i.e., AND statement. In summary, anything in the argument of $p$ is written as a mathematical statement. In the case of assigning a distribution or a functional form to $p$, there is no particular role for conditionals or joints; the modelling approach sets an appropriate structure.

Conditioning does not imply casual direction: do-Calculus do

A filtering interpretation of conditional $p(x|y)$ does not imply causal direction, but $do$ operator does, $p(x|do(y))$.

Non-commutative algebra: When frequentist are equivalent to Bayesian

Most of the simple filtering operations would result in identical results if reversed. $p(x|y) = p(y|x)$, prior being equal to posterior. This remark implies we can't apply Bayesian learning with commutative statements. We need non-commutative statements; as a result, one can do Bayesian learning with the newly arriving data, i.e., the arrival of new subjective evidence. The reason seems to be due to the frequentist nature of filtering.

Outlook

Even though we provided some revelations on decoding the operational meaning of conditional probabilities, we suggested that any conditional, joint or any combination of these within the argument of probabilities has no operational purpose other than pre-processing steps. However, the philosophical and practical implications of probabilistic reasoning are always counterintuitive. Probabilistic reasoning is a complex problem computationally. From a causal inference perspective, we are better equipped to tackle these issues with do-Bayesian analysis.

Further reading

Discrete Mathematics, Oscar Levin
Causal Inference in Statistics, A Primer, Judea Pearl, Madelyn Glymour, Nicholas P. Jewell
Indicative Conditionals Stanford Encyclopedia of Philosophy
Bayesian Epistemology Stanford Encyclopedia of Philosophy

Please Cite as:

@misc{suezen22brh,

title = {Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra},

howpublished = {\url{https://science-memo.blogspot.com/2022/07/bayesian-conditional-noncommutative.html}},

author = {Mehmet Süzen},

year = {2022}

}

Empirical risk minimization is not learning : A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning

2022-06-20T10:01:00.005-07:00

Simionescu Function (Wikipedia)

Preamble

The holy grail of machine learning appears to be the empirical risk minimisation. However, on the contrary to general dogma, the primary objective of machine learning is not risk minimisation per se but mimicking human or animal learning. Empirical risk minimisation is just a snap-shot in this direction and is part of a learning measure, not the primary objective.

Unfortunately, all current major machine learning libraries are implementing empirical risk minimisation as primary objective, so called a training, manifest as usually .fit. Here we provide a mathematical definition of learning in the language of empirical risk minimisation and its implications on two very important concepts, overfitting and Occam's razor.

Our exposition is still informal but it should be readable for experienced practitioners.

Definition: Empirical Risk Minimization

Given set of $k$ observation $\mathscr{O} = \{o_{1}, ..., o_{k} \}$ where $o_{i} \in \mathbb{R}^{n}$, $n$-dimensional vectors. Corresponding labels or binary classes, the set $\mathscr{S} = \{ s_{1}, .., s_{k}\}$, with $s_{i} \in \{0,1\}$ is defined. A function $g$ maps observations to classes $g: \mathscr{O} \to \mathscr{S}$. An error function (or loss) $E$ measures the error made by the estimated map function $\hat{g}$ compare to true map function $g$, $E=E(\hat{g}, g)$. The entire idea of supervised machine learning boils down to minimising a functional called ER (Empirical Risk), here we denoted by $G$, it is a functional, meaning is a function of function, over the domain $\mathscr{D} = Tr(\mathscr{O} x \mathscr{S})$ in discrete form, $$ G[E] = \frac{1}{k} {\Large \Sigma}_{\mathscr{D} } E(\hat{g}, g) $$. This is so called a training a machine learning model, or an estimation for $\hat{g}$. However, testing this estimate on the new data is not the main purpose of the learning.

Definition: Learning measure

A learning measures $M$, on $\hat{g}$ is defined over set of $l$ observations with increasing size, $\Theta = \{ \mathscr{O}_{1}, ..., \mathscr{O}_{l}\}$ whereby size of each set is monotonically higher, meaning that $ | \mathscr{O}_{1}| < | \mathscr{O}_{2}| , ...,< | \mathscr{O}_{l}|$.

Definition: Empirical Risk Minimization with a learning measure (ERL)

Now, we are in a position to reformulate ER with learning measure, we call this ERL. This come with a testing procedure.

If empirical risks $G[E_{j}]$ lowers monotonically, $ G[E_{1}] > G[E_{2}] > ... > G[E_{l}]$, then we said the functional form of $\hat{g}$ is a learning over the set $\Theta$.

Functional form of $\hat{g}$ : Inductive bias

The functional form implies a model selection, and a technical term of this also known as inductive bias with other assumptions, meaning the selection of complexity of the model, for example a linear regression or nonlinear regression.

Re-understanding of overfitting and Occam's razor from ERL perspective

If we have two different ERLs on $\hat{g}^{1}$ and $\hat{g}^{2}$. Then overfitting is a comparison problem between monotonically increasing empirical risks. If model, here an inductive bias or a functional form, over learning measure, we select the one with "higher monotonicity" and the less complex one and call the other overfitted model. Complexity here boils down to functional complexity of $\hat{g}^{1}$ and $\hat{g}^{2}$ and overfitting can only be tested with two models over monotonicity (increasing) of ERLs.

Conclusions

In the age of deep learning systems, the classical learning theory needs an update on how do we define what is learning beyond a single shot fitting exercise. A first step in this direction would be to improve upon basic definitions of Empirical Risk (ER) minimisation that would reflect real-life learning systems similar to forgetting mechanism proposed by Ebbinghaus. This is consistent with Tom Mitchell's definition of operational machine learning. A next level would be to add causality in the definition.

Please cite as follows:

@misc{suezen22erm,

title = { Empirical risk minimization is not learning : A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning},

howpublished = {\url{http://science-memo.blogspot.com/2022/06/empirical-risk-minimisation-learning-curve.html}},

author = {Mehmet Süzen},

year = {2022}

}

Postscript Notes

Following notes are added after initial release

Postscript 1: Understanding overfitting as comparison of inductive biases

ERM could be confusing for even experienced researchers. It is indeed about risk measure. We measure the risk of a model, i.e., machine learning procedure that how much error would it make on the given new data distribution, as in risk of investing. This is quite a similar notion as in financial risk of loss but not explicitly stated.

Moreover, a primary objective of machine learning is not ERM but measure learning curves and pair-wise comparison of inductive biases, avoiding overfitting. An inductive bias, here we restrict the concept as in model type, is a model selection step: different parametrisation of the same model are still the same inductive bias. That’s why standard training-error learning curves can’t be used to detect overfitting alone.

Postscript 2: Learning is not to optimise: Thermodynamic limit, true risk and accessible learning space

True risk minimisation in machine learning is not possible, instead we rely on ERM, i.e., Emprical Risk Minimisation. However, the purpose of machine learning algorithm is not to minimise risk, as we only have a partial knowledge about the reality through data. Learning implies finding out a region in accessible learning space whereby there is a monotonic increase in the objective; ERM is only a single point on this space, the concept rooted in German scientist Hermann Ebbinghaus work on memory.

There is an intimate connection to thermodynamic limit and true risk in this direction as an open research. However, it doesn’t imply infinite limit of data, but the observable’s behaviour. That’s why full empiricist approaches usually requires a complement of a physical laws, such as Physics Informed Neural Networks (PINNs) or Structural Causal Model (SCM).

Postscript 3: Missing abstraction in modern machine learning libraries

Interestingly current modern machine learning libraries stop abstracting further than fitting: .fit and .predict. This is short of learning as in machine learning. Learning manifest itself In learning curves. .learn functionality can be leveraged beyond fitting and if we are learning via monotonically increasing performance. Origin of this lack of tools for .learn appears to be how Empirical Risk Minimisation (ERM) is formulated on a single task.

A misconception in ergodicity: Identify ergodic regime not ergodic process

2022-05-11T11:19:00.002-07:00

Preamble

Figure 1: Two observable's approach to
ergodicity for Bernoulli Trials.

Ergodicity appears in many fields, in physics, chemistry and natural sciences but in economics to machine learning as well. Recall that, ergodicity in physics and mathematical definition diverges significantly due to Birkhoff's statistical definition against Boltzmann's physical approach. Here we will follow Birkhoff's definition of ergodicity which is a statistical one. The basic notion of ergodicity is confusing even among experienced academic circles. The primary misconception is that ergodicity is attributed to a process, a given process being ergodic. We address this by pointing out that ergodicity appears as a regime or a window so to speak for a given process's time-evolution and it can't be attributed to an entire generating process.

No such thing as ergodic process but ergodic regime given observable

A process being ergodic is not entirely true identification. Ergodicity is a regime over a given time window for a given observable derived from the process. This is the basis of ensemble theory from statistical physics. Most of the processes generates initially a non-ergodic regime given an observable. In order to identify an ergodic regime, we need to define for a discrete setting :

the ensemble (sample space) : In discrete dynamics we also have an alphabet that ensemble is composed of.
an observable defined over the sample space.
a process (usually dynamics on the sample space evolving over-time).
a measure and threshold to discriminate ergodic to non-ergodic regimes.

Interesting thing is that different observables on the same ensemble and the process may generate different ergodic regimes.

What are the processes and regime mathematically?

A process is essentially a dynamical system mathematically. this includes stochastic models and as well as deterministic systems sensitive to initial conditions. Prominently these both combined in Statistical Physics. A regime mathematically implies a range of parameters or a time-window that a system behaves very differently.

Identification of ergodic regime

Figure 2: Evolution of
time-averaged OR observable.

The main objective of finding out if dynamics produced by the process on our observable enters or in an ergodic regime is to measure if ensemble-averaged observable is equivalent to time-averaged observable value. Here equivalence is a difficult concept to address quantitatively. The simplest measure would be to check if $\Omega = \langle A \rangle_{ensemble} - \langle A \rangle_{time}$ is close to zero, i.e., vanishing. $ \Omega$ being the ergodicity measure and $A$ is the observable with different averaging procedure. This is the definition we will use here. However, beware that in the physics literature there are more advanced measures to detect ergodicity, such as considering diffusion-like behaviour, meaning that the transition from non-ergodic to ergodic regime is not abrupt but have a diffusing approach to ergodicity.

Figure 3: Evolution of
time-averaged mean.

In some other academic fields approach to ergodic regime has different names not strictly but closely related, such as in chemical physics or molecular dynamics, equilibration time, relaxation time, equilibrium, steady-state for a given observable, in statistics Monte Carlo simulations, it is usually called burn out period. Not always, but in ergodic regime, observable is stationary and time-independent. In Physics, this is much easier to distinguish because time-dependence, equilibrium and stationarity are tied to energy transfer to the system.

Ergodic regime not ergodic process : An example of Bernoulli Trials

Apart from real physical processes such as Ising Model, a basic process we can use to understand how ergodic regime could be detected using Bernoulli Trials.

Here for a Bernoulli trials/process, we will use random number generators for a binary outcome, i.e., RNG Marsenne-Twister to generate time evolution of an observables on two sites: Let's say we have two sites $x, y \in \{1, 0\}$. The ensemble of this two site system $xy$ is simply the sample space of all possible outcomes $S=\{10, 11, 01, 00\}$. Time evolution of such two site system is formulated here as choosing $\{0,1\}$ for a given site at a given time, see Appendix Python notebook.

Now, the most important part of checking ergodic regime is that we need to define an observable over two side trials. We denote two observable as $O_{1}$, which is an OR operation between sites, and $O_{2}$ is averaged over two sites. Since our sample space is small, we can compute the ensemble average observables analytically:

$O_{1} = (x+y)/2$ then $10, 11, 01, 00 ; (1/2 + 2/2 + 1/2 + 0 ) /4 = 0.5$
$O_{2} = x OR y$ then $10, 11, 01, 00 ; ( 1 + 1 + 1 + 0 )/4 = 0.75$

We can compute the time-averaged observables over time via simulations, but their formulation are know as follows:

Time average for $O_{1}$ at time $t$ (current step) is $ \frac{1}{t} \sum_{i=0}^{t} (x_{i}+y_{i})/2.0$
Time average for $O_{2}$ at time $t$ (current step) is $ \frac{1}{t} \sum_{i=0}^{t} (x_{i} OR y_{i})$.

One of the possible trajectories are shown in Figure 2 and 3. For approach to ergodicity measure, we shown this at Figure 1. Even though, we should run multiple trajectories to have error estimates, we can clearly see that ergodicity regime starts after 10K steps, at least. Moreover, different observables have different decay rates to ergodic regime. From preliminary simulation, it appears to be OR observable converges slower, though this is a single trajectory.

Conclusion

We have shown that manifestation of the ergodic regime depends on the time-evolution of the observable given a measure of ergodicity, i.e., a condition how ergodicity is detected. This exposition should clarify that a generating process does not get an attribute of "ergodic process" rather we talk about "ergodic regime" depending on observable and the process over temporal evolution. Interestingly, from Physics point of view, it is perfectly possible that an observable attains ergodic regime and then falls back to non-ergodic regime.

Further reading

Practical Understanding of Ergodicity : Elementary ergodicity and some basic references.
Is ergodicity a reasonable hypothesis? : Boltzmann's definition of ergodicity.
Scaling of ergodicity in binary systems : An idea on extending Bernoulli trials ergodicity to N-dimensions (sites).
Effective ergodicity in single-spin-flip dynamics PRE Approach to ergodicity in magnetic systems, extents to neural networks.
IsingLenzMC R package : Effective ergodicity convergence R utilities and Ising-Lenz 1-D Monte Carlo.
Diffusive behaviour of ergodicity convergence in Ising-Model.

Appendix: Code

Bernoulli Trial example we discussed is available as a Python notebook on github here.

Please cite as follows:

@misc{suezen22ergoreg,

title = {A misconception in ergodicity: Identify ergodic regime not ergodic process},

howpublished = {\url{http://science-memo.blogspot.com/2022/05/ergodic-regime-not-process.html},

author = {Mehmet Süzen},

year = {2022}

}

Physics origins of the most important statistical ideas of recent times

2022-02-11T10:48:00.008-08:00

Figure: Maxwell's handwritings,
state diagram (Wikipedia)

Preamble

The modern statistics now move into an emerging field called data science that amalgamate many different fields from high performance computing to control engineering. However, the emergent behaviour from researchers in machine learning and statistics that, sometimes they omit naïvely and probably unknowingly the fact that some of the most important ideas in data sciences are actually originated from Physics discoveries and specifically developed by physicist. In this short exposition we try to review these physics origins on the areas defined by Gelman and Vehtari (doi). Additional section is also added in other possible areas that are currently the focus of active research in data sciences.

Bootstrapping and simulation based inference : Gibbs's Ensemble theory and Metropolis's simulations

Bootstrapping is a novel idea of estimations with uncertainty with given set of samples. It is mostly popularised by Efron and his contribution is immense, making this tool available to all researchers doing quantitative analysis. However, the origins of bootstrapping can be traced back to the idea of ensembles in statistical physics, which is introduced by J. Gibbs. The ensembles in physics allow us to do just what bootstrapping helps, estimating a quantity of interest with sub-sampling, in the case of statistical physics this appears as sampling a set of different microstates. Using this idea Metropolis devised a inference in 1953, to compute ensemble averages for liquids using computers. Note that, usage of Monte Carlo approach for pure mathematical nature, i.e., solving integrals, appear much earlier with von Neumann's efforts.

Causality : Hamiltonian systems to Thermodynamic potentials

Figure: Maxwell
Relations as causal
diagrams.

Even though the historical roots of causal analysis in early 20th century attributed to Wright 1923 for his definition of path analysis, causality was the core tanents of Newtonian mechanics in distinguishing left and right of the equations of motions in the form of differential equations, and the set of differential equations following that with Hamiltonian Mechanics is actually forms a graph, i.e., relationships between generalised coordinates, momentum and positions. This connection is never acknowledge in early statistical literature, and probably causal constructions from classical physics were not well known in that community or did not find its way to data-driven mechanics. Similarly, causal construction of thermodynamic potentials appear as a directed graph as in, Born wheel. It appears as a mnemonic but it is actually causally constructed via Legendre Transformations. Of course, causality, philosophically speaking, is discussed since Ancient Greece but here we restrict the discussion on solely quantitative theories after Newton.

Overparametrised models and regularisation : Poincaré classifications and astrophysical dynamics

The current deep learning systems classified as massively overparametrized systems. However, the lower dimensional understanding of this phenomenon were well studied by Poincare's classification of classical dynamics, namely the measurement problem of having overdetermined system of differential equations, i.e., whereby inverse problems are well known in astrophysics and theoretical mechanics.

High-performance computing: Big-data to GPUs

Similarly, using supercomputers or as now we call it high-performance computation with big data generating processes were actually can be traced back to Manhattan project and ENIAC that aims solving scattering equations and almost 50 years of development on this direction before 2000s.

Conclusion

The impressive development of new emergent field of data science as a larger perspective of statistics into computer science have strong origins from core Physics literature and research. These connections are not sufficiently cited or acknowledged. Our aim in this short exposition is to bring these aspects into the attention of data science practitioners and researchers alike.

Further reading

Some of the mentioned works and related reading list, papers or books.

Please cite as follows:

@misc{suezen22pom,

title = { Physics origins of the most important statistical ideas of recent times },

howpublished = {\url{http://science-memo.blogspot.com/2022/02/physics-origins-of-most-important.html},

author = {Mehmet Süzen},

year = {2022}

}

Appendix: Pearson correlation and Lattices

Auguste Bravais is famous for his contribution in foundational work on the mathematical theory for crystallography, now seems to be going far beyond periodic solids. Unknown to many, he actually first driven the expression for what we know today as correlation coefficient or Pearson’s correlation or less commonly Pearson-Galton coefficient. Interestingly, one of the grandfathers of causal analysis Wright is mentioned this in his seminal work of 1921 titled “Correlation and causation” acknowledged Bravais for his 1849 work as the first derivation of correlation.

Appendix: Partition function and set theoretic probability

Long before Kolmogorov set forward his formal foundations of probabilities, Boltzmann, Maxwell and Gibbs build theories of statistical mechanics using probabilistic language and even define settings for set theoretic foundations by introducing ensembles for thermodynamics. For example, partition function (Z) appeared as defining a normalisation factor that summation of densities should yield to 1. Apparently Kolmogorov and contemporaries inspired a lot from physics and mechanics literature.

Appendix: Generative AI

Of course now generative AI took over the hype. Indeed physics of diffusion from Fokker-Planck equation to basic Langevin dynamics is leveraged.

Appendix: Physics is fundamental for the advancement of AI research and practice

AI as a phenomena appears to be in the domain of core physics. For this reason, studying physics as a (post)-degree or as a self-study modules will give students and practitioners alike a definitive cutting-edge insights.

Statistical models based on correlations originates from physics of periodic solids and astrophysical n-body dynamics.
Neural networks originates from the modelling magnetic materials in discrete states and later named as cooperative phenomenon. Their training dynamics closely follows free-energy minimisation.
Causality roots in ensemble theory of physical entropy.
Almost all sampling based techniques are based on the idea of sampling physics of energy surfaces, i.e. Potential Energy Surfaces. (PES).
Generative AI originates from physics of diffusion of fluids: classical Liouville description of the classical mechanics, i.e, phase-space flows and generalised Fokker-Planck dynamics.
Language models based on attention are actually coarse-grained entropy-dynamics
introduced by Gibbs: ‘Attention Layers’ behaves as coarse-graining procedure, i.e, compressed
causal graphs mapping.

This is not about building analogies to physics but as foundational topics to AI.

Periodic Spectral Ergodicity Accurately Predicts Deep Learning Generalisation

2021-11-15T11:46:00.006-08:00

Preamble

Dali (1931),
The Persistence of Memory (Wikipedia)

One of the new mathematical concepts arise due to understanding of deep learning is called periodic spectral ergodicity (PSE). The cascading PSE (cPSE) propagates over deep learning layers which can also be used as a complexity measure. cPSE actually can also predict the generalisation ability. In this post, we review this interesting finding in an easy and short manner.

How periodic spectral ergodicity cascades over layers

We have reviewed spectral ergodicity in a gentle fashion earlier, here. Only difference is that in real deep learning architectures, length of the eigenvalue spectrum, i.e., the number of bins in the histogram, generated by weight matrices are not equal in size. To align them, we use something called periodic boundary conditions or turn the eigenvalues in a cyclic fashion, up to the maximum length spectra we have seen up to that layer. Here are the steps that give, the intuition of how to compute cascading periodic spectral ergodicity (cPSE).

1. We compute eigenvalue spectrum up to a layer $i$ and align the smaller spectrum with periodic boundary conditions, i.e., cyclic.

2. Compute spectral ergodicity at layers $i$ and $i-1$.

3. Compute the cascading PSE at layer $i$ simply with a distance metric $\Omega^{i}$ and $\Omega^{i-1}$. i.e., KL divergence in two directions, recall earlier tutorials.

If we repeat this up to the last layer, cPSE measures the complexity of the deep learning architecture, both capturing structural and learning algorithm-wise, in a depth of a layer fashion.

Generalisation Gap and cPSE

Apart from being a complexity measure, cPSE predicts the generalisation gap given reference architecture i.e., it correlates with the performance almost perfectly. These findings are presented in the paper suzen2019 .

Conclusions and Outlook

The complexity of deep learning architectures are still an open research problem. One of the most promising direction is to use cPSE in terms of capturing structural complexity as well. While other measures in the literature did not consider depth dependency, whereby cPSE appears to be the first one.

Reference

@article{suzen2019,
  title={Periodic Spectral Ergodicity: A Complexity Measure for Deep Neural Networks and Neural Architecture Search},
  author={S{\"u}zen, Mehmet and Cerd{\`a}, Joan J and Weber, Cornelius},
  journal={arXiv preprint arXiv:1911.07831},
  year={2019}
}

Cite this post as Periodic Spectral Ergodicity Accurately Predicts Deep Learning Generalisation, Mehmet Süzen, https://science-memo.blogspot.com/2021/11/periodic-spectral-ergodicity-predicts-generalisation-deep-learning.html 2021

Appendix

Bristol v0.12.2 is now supporting in computing cPSE from list of matrices

from bristol import cPSE

import numpy as np

np.random.seed(42)

matrices = [np.random.normal(size=(64,64)) for _ in range(10)]

(d_layers, cpse) = cPSE.cpse_measure_vanilla(matrices)

Deep Learning in Mind a Gentle Introduction to Spectral Ergodicity

2021-07-28T08:46:00.007-07:00

Preamble

Figure: Monalisa on
Eigenvector grids (Wikipedia)

In the post, A New Matrix Mathematics for Deep Learning : Random Matrix Theory of Deep Learning, we have outlined a new mathematical concepts that are aimed at deep learning but in general belonging to applied mathematics. Here, we dive into one of the concepts, spectral ergodicity. We aimed at conveying what does it mean and how to compute spectral ergodicity for a set of matrices, i.e., ensemble. We will use a visual aid and verbal descriptions of steps to produce a quantitative measure of spectral ergodicity.

The idea of spectral ergodicity comes from quantum statistical physics but it is recently revived for deep learning as a new concept in order to accommodate mathematical needs of explaining and understanding the complexity of deep learning architectures.

Understanding Spectral Ergodicity

The concept of ergodicity can get quiet mathematical even for a professional mathematician. A practical understanding of ergodicity could lead to the law of large numbers statistically speaking. However, observed ergodicity for ensemble of matrices, i.e. over their eigenvalue spectrum, are not formally defined before in the literature, and only appeared in statistical quantum mechanics in a specialised case. Here we do a formal definition gently.

The spectral ergodicity of snapshot of values from $M$ matrices, where they are $N \times N$ sizes, denoted by $\Omega$, can be produce with the following steps:

Compute eigenvalues of $M$ matrices separately.
Produce equidistance spectra of matrices out of eigenvalues, i.e., histograms with $b_{k}$ bins. Each cell in the Figure corresponds to bin in the spectra of the matrices.
Compute average values over each bin across $M$ matrices.
Computing root mean square deviation that went to each bin from $M$ matrices from corresponding ensemble averaged value and average over $M$ and $N$. This will give a distribution, $\Omega=\Omega(b_{k})$, which represents spectral ergodicity value, think as a snapshot value of a dynamical process.

Attentive reader would notice that normally, measures of ergodicity leads to a single value, such as in spin-glasses, but here we obtain ergodicity as a measure distribution. This stems from the fact that our observable is not univariate but it is a multivariate measure over spectra of the matrix, i.e., bins in the histogram of eigenvalues.

Why spectral ergodicity important for deep learning?

The reason why this measure is so important lies in dynamics and consistency in measuring observables (no nothing to do with quantum mechanics but time and ensemble averages classically). Normally we can't measure ensemble averages. In experimental conditions the measurement we do is usually a time averaged value. This is exactly what happens when we train deep neural network, i.e, ergodicity of weight matrices. Essentially, spectral ergodicity would capture deep neural network's characteristics.

Outlook

The way we express spectral ergodicity here would only consider all layer having the same size. One would need a more advanced computation of spectral ergodicity for more realistic architectures, which is called cascading Periodic Spectral Ergodicity measure suitable as a complexity measure for deep learning. The computation of such measure is more involved and spectral ergodicity we cover here is the first step.

Cite this post with Deep Learning in Mind Very Gentle Introduction to Spectral Ergodicity, Mehmet Süzen, (2021) https://science-memo.blogspot.com/2021/07/deep-learning-random-matrix-theory-spectral-ergodicity.html

A New Matrix Mathematics for Deep Learning : Random Matrix Theory of Deep Learning

2021-07-21T09:32:00.004-07:00

Preamble

Figure: Definition of Randomness
(Compagner 1991, Delft University)

Development of deep learning systems (DLs) increased our hopes to develop more autonomous systems. Based on the hierarchal learning of representations, deep learning defies the basic learning theory that beg the question of still rethinking generalisation. Even though DLs lacks severely the ability to reason without causal inference, they can't do that in vanilla form. However despite this limitation, they provide very rich new mathematical concepts as introduced recently. Here, we review couple of these new concepts briefly and draw attention to Random Matrix Theory's relevance in DLs and its applications in Brain networks. These concepts in isolation are subject of applied mathematics but their interpretation and usage in deep learning architectures are demonstrated recently. In this post we provide a glossary of new concepts, that are not only theoretically interesting, they are directly practical from measuring architecture complexity to equivalance.

Random matrices can simulate deep learning architectures with spectral ergodicity

Random Matrix Theory (RMT) has origins in foundation of mathematical statistics and mathematical physics pioneered by Wishart Distribution and Dyson Circular Ensembles. As primary ingredient of a deep learning model as a result are set of weights, or learned parameter set, manifests as matrices and they come from a learning dynamics that are used in so called in inference time. Natural consequence of this, learning these matrices can be simulated via Random matrices of spectral radius close to unity. This provides us the following, ability to make a generic statement about deep learning systems independent of

Network architecture (topology).
Learning algorithm.
Data sizes and type.
Training procedure.

Why not Hessian or loss-landscape but Weight matrices?

There are studies taking Hessian matrix as a major object, i.e., second derivative of parameters as a function of loss of the network and associate this to random matrices. However, this approach would only covers learning algorithm properties rather than architectures inference or learning capacity. For this reason, weight matrices should be taken as a primary object in any studies of random matrix theory in deep learning as they encode depth in deep learning. Similarly, loss-landscape can not capture the capacity of deep learning.

Conclusion and outlook

In this short exposition, we tried to stimulate readers interest in exciting set of tools from RMTs for deep learning theory and practice. That is still subject of recent research with direct practical relevance. We provided glossary and reading list as well.

Further Reading

Papers introducing new mathematical concepts in deep learning are listed here, they come with associated Python codes for reproducing the concepts.

Earlier relevant blog posts

Citing this post

A New Matrix Mathematics of Deep Learning: Random Matrix Theory of Deep Learning : https://science-memo.blogspot.com/2021/07/random-matrix-theory-deep-learning.html Mehmet Süzen, 2021

Glossary of New Mathematical Concepts of Deep Learning

Summary of the definition of new mathematical concepts for new matrix mathematics.

Spectral Ergodicity Measure of ergodicity in spectra of a given random matrix ensemble sizes. Given set of matrices of equal size that are coming from the same ensemble, average deviation of spectral densities of individual eigenvalues over ensemble averaged eigenvalue. This mimic standard ergodicity, instead of over states of the observable, it measures ergodicity over eigenvalue densities. $\Omega_{k}^{N}$, $k$-th eigenvalue and matrix size of $N$.

Spectral Ergodicity Distance A symmetric distance constructed with two Kullback-Leibler distances over two different size matrix ensembles, in two different direction. $D = KL(N_{a}|N_{b})+ KL(N_{b}|N_{a})$

Mixed Random Matrix Ensemble (MME) Set of matrices constructed from a random ensemble but with difference matrix sizes from N to 2, sizes determined randomly with a coefficient of mixture.

Periodic Spectral Ergodicity (PSE) A measure of Spectral ergodicity for MMEs whereby smaller matrix spectrum placed in periodic boundary conditions, i.e., cyclic list of eigenvalues, simply repeating them up to N eigenvalues.

Layer Matrices Set of learned weight matrices up to a layer in deep learning architecture. Convolutional layers mapped into a matrix, i.e. stacked up.

Cascading Periodic Spectral Ergodicity (cPSE) Measuring PSE over feedforward manner in a deep neural network. Ensemble size is taken up-to that layer matrices.

Circular Spectral Deviation (CSD) This is a measure of fluctuations in spectral density between two ensembles.

Matrix Ensemble Equivalence If CSDs are vanishing for conjugate MMEs, they are said to be equivalent.

Appendix: Practical Python Example

Complexity measure for deep architectures and random matrix ensembles: cPSE.cpse_measure_vanilla Python package Bristol (>= v0.2.12) has now a support for computing cPSE from a list of matrices, no need to put things in torch model format by default.

!pip install bristol==0.2.12

An example case:

from bristol import cPSE

import numpy as np

np.random.seed(42)

matrices = [np.random.normal(size=(64,64)) for _ in range(10)]

(d_layers, cpse) = cPSE.cpse_measure_vanilla(matrices)

d_layers is decreasing vector, it will saturate at some point, that point is where adding more

layers won’t improve the performance. This is data, learning or architecture independent measure.

Only a French word can explain the excitement here: Voilà!

On the fallacy of replacing physical laws with machine-learned inference systems

2021-04-23T12:25:00.007-07:00

Preamble

Progress in machine learning, specifically so-called deep learning, last decade was astonishingly successful in many areas from computer vision to natural language translation reaching automation close to human-level performance in narrow areas, so-called narrow artificial intelligence. At the same time, the scientific and academic communities also joined in applying deep learning in physics and in general physical sciences. If this is used as an assistance to known techniques, it is really good progress, such as drug discovery, accelerating molecular simulations or astrophysical discoveries to understand the universe. However, unfortunately, it is now almost standard claim that one supposedly could replace physical laws with deep learning models: we criticise these claims in general without naming any of our colleagues or works.

Circular reasoning: Usage of data produced by known physics

Blind monks examining an elephant
(Wikipedia)

The primary fallacy on papers claiming to be able to produce a learning system that can actually produce physical laws or replace physics with a deep learning system lies in how these systems are trained. Regardless of how good they are in predictions, their primary ability is the product of already known laws. They would only replicate the laws provided within datasets that are generated by physical laws.

Faulty generalisation: Computational acceleration in narrow application to replacing laws

One of the major faults in concluding that a machine-learned inference system doing better than the physical law is the faulty generalisation of computational acceleration in narrow application areas. This computational acceleration can not be generalised to all parameter space while systems are usually trained in certain restricted parameter space that physical laws generated data, for example solving N-body problems, or dynamics in any scale from action or Lagrangian and generating fundamental particle physics Lagrangians.

Benefits: Causality still requires scientist

The intention of this short article here aimed at showing limitations of using machine-learned inference systems in discovering scientific laws: there are of course benefits of leveraging machine learning and data science techniques in physical sciences, especially accelerating simulations in narrow specialised areas, automating tasks and assisting scientist in cumbersome validations, such as searching and translating in two domains, especially in medicine and astrophysics, for example sorting images of galaxy formations. However, the results would still need a skilled physicist or scientist to really understand and form a judgment for a scientific law or discovery, i.e., establishing causality.

Conclusion : No automated physicist or automated scientific discovery

Artificial general intelligence is not founded yet and has not been achieved. It is for the benefit of physical sciences that researchers do not claim that they found a deep learning system that can replace physical laws in supervised or semi-supervised settings rather concentrate on applications that benefit both theoretical and applied advancement in down to earth fashion. Similarly, funding agencies should be more reasonable and avoid funding such claims.

In summary, if datasets are produced by known physical laws or mathematical principles, the new deep learning system only replicates what was already known and it is not new knowledge, regardless of how these systems can predict or behave with new predictions. Caution is advised. We can not yet replace physicists with machine-learned inference systems, actually, not even radiologists are replaced, despite the impressive advancement in computer vision that produces super-human results.

@misc{suezen21fallacy,

title = {On the fallacy of replacing physical laws with machine-learned inference systems},

howpublished = {\url{http://science-memo.blogspot.com/2021/04/on-fallacy-of-replacing-physical-laws.html}},

author = {Mehmet Süzen},

year = {2021}

}

Postscripts

The following interpretations, reformulations are curated after initial post.

Postscript 1: Regarding Symbolic regression

There are now multiple claims that one could replace physics with symbolic regression. Yes, symbolic regression is quite a powerful method. However, using raw data produced by physical laws, so called simulation data from classical mechanics or modelling experimental data guided by functional forms provided by physics do not imply that one could replace physics or physical laws with machine learned system. We have not achieved Artificial General Intelligence (AGI) and symbolic regression is not AGI. Symbolic regression may not be even useful beyond verification tool for theory and numerical solutions of physical laws.

Postscript 2: Fallacy on the dimensionality reduction and distillation of physical laws with machine learning

There are now multiple claims that one could distill physical dynamical laws with dimensionality reduction. This is indeed a novel approach. However, the core dataset is generated by the coupled set of dynamical equations that is suppose to be reduced with fixed set of initial conditions. This does not imply any kind of distillation of set of original laws, i.e., the procedure can not be qualified as distilling set of equations to less number of equations or variates. It only provides an accelerated deployment of dynamical solvers under very specific conditions. This includes any renormalisation group dynamics.

Postscript 3: A new terms, Scientific Machine Learning Fallacy and s-PINNs.

Usage of symbolic regression with deep learning should be called symbolic physics informed neural networks (s-PINNs. Calling these approaches “machine scientist”, “automated scientist”, “physics laws generator” are technically a fallacy, i.e., Scientific Machine Learning Fallacy, primarily caught up in circular reasoning.

Postscript 4: AutoML is a misnomer : Scientific Machine Learning (SciML) Fallacy

The primary fallacy on papers claiming to be able to produce a learning system that can actually produce physical/scientific laws or replace physics/science with a deep learning system lies in how these systems are trained. AutoML in this context actually doesn’t replace scientist but abstract out former workflows into different meta scientific work assisting scientists: hence a misnomer, MetaML is probably more suited terminology.

Postscript 5: Scientific Machine Learning (SciML) and AI for Science (AI4S): Accelerated meta discovery

SciML is immensely promising in providing accelerated deployment of known scientific workflows: specialised areas such as trajectory learning, novel operator solvers, astrophysical image and cosmological analysis, molecular dynamics and computational applied mathematics in general. Unfortunately, some recent papers or young scientists in their experiences had an impression that SciML is an automated scientific discovery tool that is a replacement of known physical laws with supervised learning systems, including new LLM systems. The primary fallacy here the resulting learning system can only produce physical/scientific laws guided by scientists: designing new PINNs and training methodologies for increasing speed of parameter exploration.

AutoML or SciML in this context actually doesn’t replace scientists but abstract out former workflows into different meta scientific work assisting scientists. Accelerated meta discovery is probably more suited terminology to avoid such early misunderstandings.

Shifting Modern Data Science Forward: Dijkstra principle for data science

2021-04-01T10:40:00.007-07:00

Kindly reposted to KDnuggets by Gregory Piatetsky-Shapiro with the title Data science is not about data -applying Dijkstra principle to data science and enhancements.

Prelude

Dijkstra in Zurich, 1984 (Wikipedia)

Edsger Dijkstra was a Dutch theoretical physicist turned computer scientist, and probably one of the most influential earlier pioneers in the field. He had deep insight in what is computer science and well founded notion of how should it be taught in academics. In this post we extrapolate his ideas into data science. We developed something called, Dijkstra principle for data science, that is driven by his ideas on what does computer science entails.

Computer Science and Astronomy

Astronomy is not about telescopes. Indeed, it is about how universe works and how its constituent parts are interacting. Telescopes, either being optical or radio observations or similar detection techniques are merely tools to practice and do investigation for astronomy. A formed analogy goes into computer science as well, this is the quote from Dijkstra:

Computer science is no more about computers than astronomy is about telescopes. - Edsger Dijkstra

The idea of Computer Science being not about computer is rather strange in the first instance. However, what Dijkstra had in mind is abstract mechanism and mathematical constructs that one can map real problems and solve it as a computer science problem, such as graph algorithms. Though Computer Science had a lot of subfields but its inception can be considered as rooted in applied mathematics.

Dijkstra principle for data science

By using Dijkstra's approach now we are in position to formulate a principle for data science.

Data science is no more about data than computer science is about computers. -Dijkstra principle for data science

This sounds absurd. If data science is not about data, then what is it about? Apart from definition of data science as an emergent field, as an amalgamation of multiple fields from statistics to high performance computing, the idea that data not being the core tenant of data science implies the practice does not aim at data itself rather a higher purpose. Data is used similar to a telescope in astronomy, the purpose is to reveal the empirical truths about representations data conveys. There is no unique ways to achieve this purpose.

Conclusive Remarks

Dijkstra principle for data science would be very helpful in understanding the data science practice as not data-centric, contrary to mainstream dogma, rather as a science-centric practice with the data being the primary tool to leverage, using multitude of techniques. Implication is that machine learning is a secondary tool on top of data in practicing data science. This attitude would help causality playing a major role shifting modern data science forward.

Computable function analogs of natural learning and intelligence may not exist

2021-03-20T14:34:00.008-07:00

Optimal learning : Meta-optimization

Many papers directly equate “machine” learning problem, algorithmic learning oppose to human or animal learning, with optimisation problem. Unfortunately, contrary to common belief machine learning is not an optimisation problem. For example, take optimal learning strategy, a replace learning with optimisation and we end up having and absurd terms of optimal optimisation strategy at one point.

Turing machine (Wikipedia)

Sound like practiced machine learning is a meta-optimisation problem, rather than a learning as humans do.

Computable functions to learning

Fundamentally, we do not know how human learning can be mapped into an algorithm or if there are computable function analogs of human learning or if human intelligence and its artificial analog can be represented as Turing computable manner.

Scientific Memo

Why the simplest explanation is always the best

Compressive algorithmic randomness: Gibbs-randomness proposition for massively energy efficient deep learning

Figure: Dual Tomographic CompressionPerformance, Süzen, 2025.Preamble

A strange tale of path from applied research to fundamental proposition.

New concepts in compression and randomness via train-compress cycles

Conclusion

Further reading

Ultimate physical limit of data storage: Connecting Bekenstein Bound to Landauer Principle

Preamble Bekenstein's Information (Wikipedia)

What is the Bekenstein Bound?

What is Landauer Principle?

Physical limit of data storage: 1 BekensteinBytes

Outlook

Basic understanding of a metric tensor: Disassemble the concept of a distance over Riemannian Manifolds

Mathematical Definition of Heuristic Causal Inference: What differentiates DAGs and do-calculus?

Resolution of misconception of overfitting: Differentiating learning curves from Occam curves

Loschimidt's Paradox and Causality: Can we establish Pearlian expression for Boltzmann's H-theorem?

Insights into Bekenstein entropy with an intuitive mathematical definitions: A look into Thermodynamics of Black-holes

Misconceptions on non-temporal learning: When do machine learning models qualify as prediction systems?

The concept of overgeneralisation and goodness of rank : Overfitting is not about comparing training and test learning curves

The conditional query fallacy: Applying Bayesian inference from discrete mathematics perspective

Preamble

Statement mappings as definition of probability

Conditional Query Fallacy

Outlook

Glossary of concepts

Notes and further reading

Differentiating ensembles and sample spaces: Alignment between statistical mechanics and probability theory

Overfitting is about complexity ranking of inductive biases : Algorithmic recipe

Heavy-matter-wave and ultra-sensitive interferometry: An opportunity for quantum-gravity becoming an evidence based research

Building robust AI systems: Is an artificial intelligent agent just a probabilistic boolean function?

Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra

Empirical risk minimization is not learning : A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning

A misconception in ergodicity: Identify ergodic regime not ergodic process

Physics origins of the most important statistical ideas of recent times

Statistical mechanics of ensemble learning, Anders Krogh and Peter Sollich (1997)

Periodic Spectral Ergodicity Accurately Predicts Deep Learning Generalisation

Deep Learning in Mind a Gentle Introduction to Spectral Ergodicity

A New Matrix Mathematics for Deep Learning : Random Matrix Theory of Deep Learning

On the fallacy of replacing physical laws with machine-learned inference systems

Shifting Modern Data Science Forward: Dijkstra principle for data science

Kindly reposted to KDnuggets by Gregory Piatetsky-Shapiro with the title Data science is not about data -applying Dijkstra principle to data science and enhancements.

Computer Science and Astronomy

Computable function analogs of natural learning and intelligence may not exist

Figure: Dual Tomographic Compression
Performance, Süzen, 2025.
Preamble

Preamble
Bekenstein's Information
(Wikipedia)