Loading [MathJax]/extensions/TeX/AMSsymbols.js

Tuesday, 22 April 2025

Ultimate physical limit of data storage:
Connecting Bekenstein Bound to Landauer Principle

Preamble 
Bekenstein's Information
(Wikipedia)

Black holes are not too esoteric anymore after LIGO's success and successful imaging efforts. Their entropy behaves much different than the entropy of an ordinary matter. This leads to incredible discovery of so called holographic principle. The principle stating that we live on a projection of higher-dimensional manifestation of universe. On the other hand, Landauer made silently a discovery on energy expenditure of keeping information processing reversible. Similar bound put forward by Bekenstein for black holes made Landauer-Bekenstein bound for quantum gravity a natural avenue to study. 

What is the Bekenstein Bound?

Basically, this puts limits the size of a black hole, via its entropy bound. Essentially it states that, entropy of a black hole $S$ is bounded with the radius of the black hole $R$ and Energy $E$, other constants being 1, 

$$S \le R \dot E$$

What is Landauer Principle?

A limit on any process wants to delete 1 bit of information,  has to dissipate energy proportional to its temperature, 

$$ E \ge T ln 2$$ 

Again we made the constants to 1. 

Physical limit of data storage: 1 BekensteinBytes

Using both Bekenstein bound and Landauer principle one can compute the a physical limit for a data storage on a unit sphere,

$$S \sim T$$

If we scaled this with inverse of Planck area $\ell_{p}$; One bit of information proportional to Planck area and maximum attainable temperature $10^{32}$ K is scaled with this. The final value corresponds to $\approx 10^{100}$ bits. This again corresponds to about 10 Giga-Quetta-Quatta-Quatta bytes (1 Quetta byte is $10^{30}$ bytes). 

At this point in time, we can propose that $10^{100}$, would be to call 1 BekensteinBytes.  However with more fine-grain computations, the number may change. Our purpose here is to give a very rough idea about the scale of ultimate physical limitations of data storage in BekensteinBytes.

Bekenstein Information Conjecture : One cannot compress more than 1 BekensteinBytes on a smallest patch of space.

Outlook

An interesting connections in quantum gravity and computation, provides certain physical limitations on how much information we can be stored at a given unit space.  We called this Bekenstein Information Conjecture. 

Further reading

Cite as 

 @misc{suezen25datalimit, 
     title = {Ultimate physical limit of data storage: Connecting Bekenstein Bound  to Landauer Principle}, 
     howpublished = {\url{https://science-memo.blogspot.com/2025/04/ultimate-physical-limit-of-data-storage.html}}, 
     author = {Mehmet Süzen},
     year = {2025}
}  


Friday, 26 April 2024

Basic understanding of a metric tensor:
Disassemble the concept of a distance over Riemannian Manifolds

Preamble 

Gaussian Curvature (Wikipedia)
One of the core concepts in Physics is so called metric tensor. This object encodes any kind of geometry. Combined genius of Gauß,  Riemann and their contemporaries lead to such a great idea, probably one of the achievements of human quantitative enlightenment. However, due to notational aspects and lack of obvious pedagogical introduction, making object elusive mathematically. Einstein's notation made this more accessible but still, it requires more explicit explanation.  In this post we disassemble the definition of a distance over any geometry with an alternate notation. 

Disassemble distance over a Riemannian Manifolds

The most general definition of a distance between two infinitesimal points on any geometry, or a fancy word for it, is a Riemannian Manifolds, is defined with the following definitions. Manifold is actually a sub-set of any geometry we are concerned with and Riemannian implies a generalisation.

Definition 1: (Points on a Manifold) Any two points are defined in general coordinates $X$ and $Y$,  They are defined as row and column vectors respectively. In infinitesimal  components. $X = [dx_{1}, dx_{2}, ..., dx_{n}]$ and $Y= [dx_{1}, dx_{2}, ..., dx_{m}]^{T}$. 

Geometric description between two points are defined as tensor product, $\otimes$, that is to say we form a grid, on the geometry. 

Definition 2: (A infinitesimal grid) A grid on a geometry formed by pairing up each point's components, i.e., a tensor product. This would be a matrix $P^{mn}$, (as in points), with the first row $(dx_{1} dx_{1},.....,dx_{1} dx_{n})$ and the last row $(dx_{m} dx_{1},.....,dx_{m} dx_{n})$.

Note that grid here is used as a pedagogical tool, points are actually leaves on the continuous manifold. Now, we want to compute the distance between grid-points, $dS^{n}$, then metric tensor come to a rescue

Definition 3: (Metric tensor) A metric tensor $G^{nm}$ describes a geometry of the manifold that connects the infinitesimal grid to distance, such that $dS^{n}=G^{nm} P^{mn}$.

Note that, these definition can be extended to higher-dimensions, as in coordinates are not 1D anymore. We omit the square-root on the distance, as that''s also a specific to L2 distance.  Here, we can think of $dS^{n}$ a distance vector, and $G^{nm}$ and $P^{mn}$ are the matrices. 

Exercise: Euclidian Metric

A familiar geometry, metric for Euclidian space, reads diagonal elements of all 1 and rest of the elements zero. How the above definitions holds, left as an exercise. 

Conclusion 

We have discuss that a metric tensor, contrary to its name, it isn't a metric per se but an object that describes a geometry, having magic ability to connecting grids to distances. 

Further reading

There are a lot of excellent books out there but a classic Spivak's differential geometry is recommended. 

Please cite as follows:

 @misc{suezen24bumt, 
     title = {Basic understanding of a metric tensor: Disassemble the concept of a distance over Riemannian Manifolds}, 
     howpublished = {\url{https://science-memo.blogspot.com/2024/04/metric-tensor-basic.html}, 
     author = {Mehmet Süzen},
     year = {2024}
}  

Friday, 10 November 2023

Mathematical Definition of Heuristic Causal Inference:
What differentiates DAGs and do-calculus?

Preamble 

David Hume
David Hume (Wikipedia)
Experimental design is not a new concept and randomised control trials (RCTs) are our solid gold standard of doing quantitative research, when no apparent physical laws are available to validate observations.  However, it is very expensive to design RCTs, not ethical or either not possible due to logistical reasons in some cases. Then we fall into Causal Inference's heuristic frameworks, such as potential outcomes, matching, and time-series interventions in imagining counterfactuals and interventions. These methods provide immensely successful toolbox for quantitative scientist where by systems do not have any known physical laws. DAGs and do-calculus, differentiates from all these approaches that try to move away from full heuristics. In this post we try to postulate this formally in mathematical terms in the context of causal inference over observational data framework. We established that DAGs and do-calculus bring  mathematically more principled way of practicing causal inference akin to theoretical physics attitude. 

Definition of Heuristic Causal Inference (HeuristicCI) : Observational Data 

Heuristics in general implies an algorithmic approximate solution, usually appear as numerical and statistical algorithms in causal inference whereby full RCT is not available. This can be formalised as follows, 

Definition (HeuristicCI) Given dataset of  $n-$dimensions $\mathscr{D} \in \mathbb{R}^{n}$ observation, having variates of $X=x_{i}$, with each having different sub-sets (categories within $x_{i}$), having at least one category of observations.  We want to test  causal connection between two distinct subsets of $X$,  $\mathscr{S}_{1} , \mathscr{S}_{2}$, given an interventional versions or imagined counterfactual where by at least one of the subset is available,  $\mathscr{S}_{1}^{int} , \mathscr{S}_{2}^{int}$. Using an algorithm $\mathscr{A}$ that processes dataset to test an effect size $\delta$ using a statistic $\beta$,  as follows, $$ \delta= \beta(\mathscr{S}_{1} , \mathscr{S}_{1}^{int})-\beta(\mathscr{S}_{2} , \mathscr{S}_{2}^{int})$$ statistic $\beta$ can be result of a machine learning procedure as well and difference in $\delta$ is only a particular choice, i.e., such as Average Treatment Effect (ATE). The algorithm  $\mathscr{A}$ is called  HeuristicCI.

Many of the non-DAGs and do-calculus methods directly falls into this category, such as potential outcomes, upliftmatching and synthetic controls.  This definition could be quite obvious to practitioners that has a good handle in mathematical definitions. Moreover, HeuristicCI  implies solely data-driven approach to causality inline with Hume's pure-empirical view-point. 

Primary distinction in practicing DAGs that it brings causal ordering naturally [suezen23pco] with scientist's cognitive process encoded, where by HeuristicCI search for statistical effect size that has a causal component in fully data-driven way. However, a HybridCI would entails using DAGs and do-calculus in connection with data driven approaches.

Conclusion

In this short exposition, we introduced HeuristicCI  concept that category of methods that do not use DAGs and do-calculus explicitly in causal inference practice. However, we do not put a well designed RCTs  in this category. Because, as a gold standard approach whereby properly encoded experimental design generates full interventional data reflecting scientist's domain knowledge. 

References and Further reading

Please cite as follows:

 @misc{suezen23hci, 
     title = {Mathematical Definition of Heuristic Causal Inference: What differantiates DAGs and do-calculus?}, 
     howpublished = {\url{https://science-memo.blogspot.com/2023/11/heuristic-causal-inference.html}, 
     author = {Mehmet Süzen},
     year = {2023}
}  

Postscript A: Why Pearlian Causal Inference is very significant progress for empirical science? 

Judea Pearl's framework for causality sometimes referred to as “mathematisation of causality”. However, “axiomatic foundations of causal inference” is fair identification, Pearl's contribution to the field is in par with Kolmogorov's axiomatic foundations of probability. Key papers of this axiomatic foundations are published in 1993 (back-doors) [1] and 1995 (do-calculus) [2].  


Original works of Axiomatic foundation for causal inference:

[1] Pearl, J., “Graphical models, causality, and intervention,” Statistical Science, Vol. 8, pp. 266–269, 1993. 

[2] Pearl, J., “Causal diagrams for empirical research,” Biometrika, Vol. 82, Num. 4, pp. 669–710, 1995. 

Saturday, 1 April 2023

Resolution of misconception of overfitting: Differentiating learning curves from Occam curves

Preamble 

Occam (Wikipedia)
A misconception that overfitted model can be identified with the  amount of generalisation gap between model's training and test sets over its learning curves is still out there. Even in some prominent online lectures and blog posts, this misconception is now repeated without critical look. In general, this practice unfortunately diffuse into some academic papers and industrial,  practitioners attribute poor generalisation to overfitting. We have provided a resolution of this via a new conceptual identification of complexity plots, so called Occam's curves differentiating from a learning curve. An accessible mathematical definitions here will clarify the resolution of the confusion.   

Learning Curve Setting: Generalisation Gap 

Learning curves explain how a given algorithm's generalisation improves over time or experience, originating from Ebbinghaus's work on human memory.  We use inductive bias to express a model, as model can manifest itself in different forms from differential equations to deep learning.

Definition: Given inductive bias $\mathscr{M}$ formed by $n$ datasets with monotonically increasing sizes  $\mathbb{T} = \{|\mathbb{T}_{0}| > |\mathbb{T}_{1}| > ...> |\mathbb{T}_{n}| \}$. A learning curve $\mathscr{L}$ for $\mathscr{M}$ is expressed by the performance measure of the model over datasets,  $\mathbb{p} = \{ p_{0},  p_{1}, ... p_{n} \}$, hence $\mathscr{L}$ is a curve on the plane of $(\mathbb{T}, p)$.  

By this definition, we deduce that $\mathscr{M}$ learns if $\mathscr{L}$ increases monotonically. 

A generalisation gap is defined as follows. 

Definition: Generalisation gap for inductive bias $\mathscr{M}$ is the difference between its' learning curve $\mathscr{L}$ and the learning curve of the unseen datasets, i.e., so-called training, $\mathscr{L}^{train}$. The difference can be simple difference, or a measure differentiating the gap.

We conjecture the following. 

Conjecture: Generalisation gap can't identify if $\mathscr{M}$ is an overfitted model. Overfitting is about Occam's razor, and requires a pairwise comparison between two inductive biases of different complexities.

As conjecture suggests that generalisation gap is not about overfitting, despite the common misconception. Then, why the misconception? The misconception lies on the confusion of how to produce the curve that we could judge overfitting. 

Occam Curves: Overfitting Gap [Occam's Gap] 

In the case of generating Occam curves, a complexity measure  $\mathscr{C}$  over different inductive biases $\mathscr{M_{i}}$ plays a role. Then the definition reads. 

Definition: Given $m$ inductive bias $\mathscr{M_{i}}$ formed by $n$ datasets with monotonically increasing sizes  $\mathbb{T} = \{|\mathbb{T}_{0}| > |\mathbb{T}_{1}| > ...> |\mathbb{T}_{n}| \}$. An Occam curve $\mathscr{O}$ for a given $\mathscr{M}$ is expressed by the performance measure of the model over complexity-dataset size points  $\mathbb{F} = [(|\mathbb{T}_{0}|, \mathscr{C}),  (|\mathbb{T}_{1}| , \mathscr{C}), ...,  (|\mathbb{T}_{n}| , \mathscr{C}) ] $; Performance of a given inductive bias reads $\mathbb{p} = \{ p_{0},  p_{1}, ... p_{n} \}$, hence Occam curve, $\mathscr{O}$ is a curve on the plane of $(\mathbb{F}, p)$.  
 
Given definition, producing Occam curves are more complicated than simply plotting test and train curves over batches. The ordering in $\mathbb{F}$ forms what is so-called goodness of rank.

Summary and take home

Resolution of misconception of overfitting lies in producing Occam curves to judge the bias-variance tradeoff, not the learning curves of a single model. 

Further reading & notes

  • Further posts and a glossary : The concept of overgeneralisation and goodness of rank.
  • Double decent phenomenon, it uses Occam's curves, not learning curves.
  • We use dataset size as an interpretation of increasing experience, there could be other ways of expressing a gained experience, but we take the most obvious evidence.
Please cite as follows:

 @misc{suezen23rmo, 
     title = {Resolution of misconception of overfitting: Differentiating learning curves from Occam curves}, 
     author = {Mehmet Süzen},
     year = {2023}
}  

Postscript notes

Take home messages

Understanding Generalisation Gap and Occam’s gap

Model selection and evaluations are usually confused by novice and as well as experienced data scientists and professionals doing modelling. There are a lot of misconceptions in the literature, but in practice primary take home messages can be summarised as follows:

1. What is a model? A model is an “inductive bias” of the modeller, a selected parametrised functions for example, a neural network architecture choice. Contrary to many, specific parametrisation of a model (deep learning architecture) is not a different model.
2. A model’s test and training performance difference is about generalisation gap. Overfitting and under-fitting is not about generalisation gap.
3. Overfitting or under-fitting is a comparison problem: How a model deviates from a reference model? This is called Occam’s gap or so called model selection error.
4. Occam’s gap generalises Empirical Risk minimisation over a learning curve.  Empirical risk minimisation itself is not about learning.

How and when a model generalises well and generalisation of empirical risk minimisation are currently an open research topics.

(c) Copyright 2008-2024 Mehmet Suzen (suzen at acm dot org)

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.