Tuesday, 3 June 2025

Compressive algorithmic randomness:
Gibbs-randomness proposition for massively energy efficient deep learning

Figure: Dual Tomographic Compression
Performance, Süzen, 2025.
Preamble

Randomness is elusive and its probably one of the outstanding concepts for human scientific endeavour, along with gravity. Kolmogorov complexity, appears to be so novel in trying to answering "what is randomness?". The idea that the length of the smallest model that can generate the random sequence determines its complexity was a turning point in history of science. Similarly, it implies choosing the simplest model for explaining a phenomenon. That's why Kolmogorov's work was also supported by the ideas of Solomonoff and Chaitin. A recent work, explores this algorithmic information from compression perspective with Gibbs entropy.

A strange tale of path from applied research to fundamental proposition. 

During study of model compression algorithm development, I have noticed an amazing behaviour that information, entropy and compression over compression process have a more in depth. 

New concepts in compression and randomness via train-compress cycles

Here, we explain the new concepts for both deep learning model compression and on the interplay between compression and algorithmic randomness.

Inverse compressed sensing (iCS): Normally CS procedure is applied to reconstruct an unknown signal with fewer measurements. In the case of deep learning train-compress, weights are known at one point in the training cycle. If we create hypothetical measurements, using CS formulations, we can reconstruct weights sparse projection. 

Dual Tomographic Compression: Applying iCS for the input and output of neuronal level, layer-wise, simultaneously.  

Weight rays:  An output reconstructed vector out of DTC; weights given sparsity level, though they are not generated in isolation but within train-compress cycle.

Gibbs randomness proposition : An extension of Kolmogorov complexity for a compression process. That, directed randomness is the same as complexity reduction, i.e., compression. 

Conclusion 

A new technique called DTC can be used to train deep learning with model compression on the fly. This gives rise to massively energy efficient deep learning, reaching almost ~98% reduction in energy use.  Moreover, the technique also demonstrated an extended version of Kolmogorov complexity. 

Further reading 

Paper & codes are released :


Cite as 

 @misc{suzen25car, 
     title = {Compressive algorithmic randomness: <br>Gibbs-randomness proposition for massively energy efficient deep learning}, 
     howpublished = {\url{ https://science-memo.blogspot.com/2025/06/compressive-algorithmic-randomness.html}}, 
     author = {Mehmet Süzen},
     year = {2025}
}  

Tuesday, 22 April 2025

Ultimate physical limit of data storage:
Connecting Bekenstein Bound to Landauer Principle

Preamble 
Bekenstein's Information
(Wikipedia)

Black holes are not too esoteric anymore after LIGO's success and successful imaging efforts. Their entropy behaves much different than the entropy of an ordinary matter. This leads to incredible discovery of so called holographic principle. The principle stating that we live on a projection of higher-dimensional manifestation of universe. On the other hand, Landauer made silently a discovery on energy expenditure of keeping information processing reversible. Similar bound put forward by Bekenstein for black holes made Landauer-Bekenstein bound for quantum gravity a natural avenue to study. 

What is the Bekenstein Bound?

Basically, this puts limits the size of a black hole, via its entropy bound. Essentially it states that, entropy of a black hole $S$ is bounded with the radius of the black hole $R$ and Energy $E$, other constants being 1, 

$$S \le R \dot E$$

What is Landauer Principle?

A limit on any process wants to delete 1 bit of information,  has to dissipate energy proportional to its temperature, 

$$ E \ge T ln 2$$ 

Again we made the constants to 1. 

Physical limit of data storage: 1 BekensteinBytes

Using both Bekenstein bound and Landauer principle one can compute the a physical limit for a data storage on a unit sphere,

$$S \sim T$$

If we scaled this with inverse of Planck area $\ell_{p}$; One bit of information proportional to Planck area and maximum attainable temperature $10^{32}$ K is scaled with this. The final value corresponds to $\approx 10^{100}$ bits. This again corresponds to about 10 Giga-Quetta-Quatta-Quatta bytes (1 Quetta byte is $10^{30}$ bytes). 

At this point in time, we can propose that $10^{100}$, would be to call 1 BekensteinBytes.  However with more fine-grain computations, the number may change. Our purpose here is to give a very rough idea about the scale of ultimate physical limitations of data storage in BekensteinBytes.

Bekenstein Information Conjecture : One cannot compress more than 1 BekensteinBytes on a smallest patch of space.

Outlook

An interesting connections in quantum gravity and computation, provides certain physical limitations on how much information we can be stored at a given unit space.  We called this Bekenstein Information Conjecture. 

Further reading

Cite as 

 @misc{suezen25datalimit, 
     title = {Ultimate physical limit of data storage: Connecting Bekenstein Bound  to Landauer Principle}, 
     howpublished = {\url{https://science-memo.blogspot.com/2025/04/ultimate-physical-limit-of-data-storage.html}}, 
     author = {Mehmet Süzen},
     year = {2025}
}  


Friday, 26 April 2024

Basic understanding of a metric tensor:
Disassemble the concept of a distance over Riemannian Manifolds

Preamble 

Gaussian Curvature (Wikipedia)
One of the core concepts in Physics is so called metric tensor. This object encodes any kind of geometry. Combined genius of Gauß,  Riemann and their contemporaries lead to such a great idea, probably one of the achievements of human quantitative enlightenment. However, due to notational aspects and lack of obvious pedagogical introduction, making object elusive mathematically. Einstein's notation made this more accessible but still, it requires more explicit explanation.  In this post we disassemble the definition of a distance over any geometry with an alternate notation. 

Disassemble distance over a Riemannian Manifolds

The most general definition of a distance between two infinitesimal points on any geometry, or a fancy word for it, is a Riemannian Manifolds, is defined with the following definitions. Manifold is actually a sub-set of any geometry we are concerned with and Riemannian implies a generalisation.

Definition 1: (Points on a Manifold) Any two points are defined in general coordinates $X$ and $Y$,  They are defined as row and column vectors respectively. In infinitesimal  components. $X = [dx_{1}, dx_{2}, ..., dx_{n}]$ and $Y= [dx_{1}, dx_{2}, ..., dx_{m}]^{T}$. 

Geometric description between two points are defined as tensor product, $\otimes$, that is to say we form a grid, on the geometry. 

Definition 2: (A infinitesimal grid) A grid on a geometry formed by pairing up each point's components, i.e., a tensor product. This would be a matrix $P^{mn}$, (as in points), with the first row $(dx_{1} dx_{1},.....,dx_{1} dx_{n})$ and the last row $(dx_{m} dx_{1},.....,dx_{m} dx_{n})$.

Note that grid here is used as a pedagogical tool, points are actually leaves on the continuous manifold. Now, we want to compute the distance between grid-points, $dS^{n}$, then metric tensor come to a rescue

Definition 3: (Metric tensor) A metric tensor $G^{nm}$ describes a geometry of the manifold that connects the infinitesimal grid to distance, such that $dS^{n}=G^{nm} P^{mn}$.

Note that, these definition can be extended to higher-dimensions, as in coordinates are not 1D anymore. We omit the square-root on the distance, as that''s also a specific to L2 distance.  Here, we can think of $dS^{n}$ a distance vector, and $G^{nm}$ and $P^{mn}$ are the matrices. 

Exercise: Euclidian Metric

A familiar geometry, metric for Euclidian space, reads diagonal elements of all 1 and rest of the elements zero. How the above definitions holds, left as an exercise. 

Conclusion 

We have discuss that a metric tensor, contrary to its name, it isn't a metric per se but an object that describes a geometry, having magic ability to connecting grids to distances. 

Further reading

There are a lot of excellent books out there but a classic Spivak's differential geometry is recommended. 

Please cite as follows:

 @misc{suezen24bumt, 
     title = {Basic understanding of a metric tensor: Disassemble the concept of a distance over Riemannian Manifolds}, 
     howpublished = {\url{https://science-memo.blogspot.com/2024/04/metric-tensor-basic.html}, 
     author = {Mehmet Süzen},
     year = {2024}
}  

Friday, 10 November 2023

Mathematical Definition of Heuristic Causal Inference:
What differentiates DAGs and do-calculus?

Preamble 

David Hume
David Hume (Wikipedia)
Experimental design is not a new concept and randomised control trials (RCTs) are our solid gold standard of doing quantitative research, when no apparent physical laws are available to validate observations.  However, it is very expensive to design RCTs, not ethical or either not possible due to logistical reasons in some cases. Then we fall into Causal Inference's heuristic frameworks, such as potential outcomes, matching, and time-series interventions in imagining counterfactuals and interventions. These methods provide immensely successful toolbox for quantitative scientist where by systems do not have any known physical laws. DAGs and do-calculus, differentiates from all these approaches that try to move away from full heuristics. In this post we try to postulate this formally in mathematical terms in the context of causal inference over observational data framework. We established that DAGs and do-calculus bring  mathematically more principled way of practicing causal inference akin to theoretical physics attitude. 

Definition of Heuristic Causal Inference (HeuristicCI) : Observational Data 

Heuristics in general implies an algorithmic approximate solution, usually appear as numerical and statistical algorithms in causal inference whereby full RCT is not available. This can be formalised as follows, 

Definition (HeuristicCI) Given dataset of  $n-$dimensions $\mathscr{D} \in \mathbb{R}^{n}$ observation, having variates of $X=x_{i}$, with each having different sub-sets (categories within $x_{i}$), having at least one category of observations.  We want to test  causal connection between two distinct subsets of $X$,  $\mathscr{S}_{1} , \mathscr{S}_{2}$, given an interventional versions or imagined counterfactual where by at least one of the subset is available,  $\mathscr{S}_{1}^{int} , \mathscr{S}_{2}^{int}$. Using an algorithm $\mathscr{A}$ that processes dataset to test an effect size $\delta$ using a statistic $\beta$,  as follows, $$ \delta= \beta(\mathscr{S}_{1} , \mathscr{S}_{1}^{int})-\beta(\mathscr{S}_{2} , \mathscr{S}_{2}^{int})$$ statistic $\beta$ can be result of a machine learning procedure as well and difference in $\delta$ is only a particular choice, i.e., such as Average Treatment Effect (ATE). The algorithm  $\mathscr{A}$ is called  HeuristicCI.

Many of the non-DAGs and do-calculus methods directly falls into this category, such as potential outcomes, upliftmatching and synthetic controls.  This definition could be quite obvious to practitioners that has a good handle in mathematical definitions. Moreover, HeuristicCI  implies solely data-driven approach to causality inline with Hume's pure-empirical view-point. 

Primary distinction in practicing DAGs that it brings causal ordering naturally [suezen23pco] with scientist's cognitive process encoded, where by HeuristicCI search for statistical effect size that has a causal component in fully data-driven way. However, a HybridCI would entails using DAGs and do-calculus in connection with data driven approaches.

Conclusion

In this short exposition, we introduced HeuristicCI  concept that category of methods that do not use DAGs and do-calculus explicitly in causal inference practice. However, we do not put a well designed RCTs  in this category. Because, as a gold standard approach whereby properly encoded experimental design generates full interventional data reflecting scientist's domain knowledge. 

References and Further reading

Please cite as follows:

 @misc{suezen23hci, 
     title = {Mathematical Definition of Heuristic Causal Inference: What differantiates DAGs and do-calculus?}, 
     howpublished = {\url{https://science-memo.blogspot.com/2023/11/heuristic-causal-inference.html}, 
     author = {Mehmet Süzen},
     year = {2023}
}  

Postscript A: Why Pearlian Causal Inference is very significant progress for empirical science? 

Judea Pearl's framework for causality sometimes referred to as “mathematisation of causality”. However, “axiomatic foundations of causal inference” is fair identification, Pearl's contribution to the field is in par with Kolmogorov's axiomatic foundations of probability. Key papers of this axiomatic foundations are published in 1993 (back-doors) [1] and 1995 (do-calculus) [2].  


Original works of Axiomatic foundation for causal inference:

[1] Pearl, J., “Graphical models, causality, and intervention,” Statistical Science, Vol. 8, pp. 266–269, 1993. 

[2] Pearl, J., “Causal diagrams for empirical research,” Biometrika, Vol. 82, Num. 4, pp. 669–710, 1995. 

(c) Copyright 2008-2024 Mehmet Suzen (suzen at acm dot org)

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.