tag:blogger.com,1999:blog-45505539730325036692024-03-04T23:44:45.214-08:00Scientific MemoScientific Scratch Pad of Memo: <br>
Physics, Mathematics, Computer Science, Statistics, Chemistry <br>
<br> by <a href="https://member.acm.org/~suzen">Mehmet Süzen</a> <br>
See also: <a href="http://memosisland.blogspot.de/"> Memo's Island Blog</a>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.comBlogger88125tag:blogger.com,1999:blog-4550553973032503669.post-46222526944693613322023-11-10T09:54:00.000-08:002024-01-06T12:41:41.939-08:00Mathematical Definition of Heuristic Causal Inference: What differentiates DAGs and do-calculus? <p><b>Preamble </b></p><p><i></i></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f1/DavidHumeStatueEdinburgh.jpg/1024px-DavidHumeStatueEdinburgh.jpg" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img alt="David Hume" border="0" data-original-height="800" data-original-width="556" height="320" src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f1/DavidHumeStatueEdinburgh.jpg/1024px-DavidHumeStatueEdinburgh.jpg" title="David Hume" width="222" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">David Hume (Wikipedia)</td></tr></tbody></table><i>Experimental design</i> is not a new concept and <i>randomised control trials (RCTs</i>) are our solid gold standard of doing quantitative research, when no apparent physical laws are available to validate observations. However, it is very expensive to design RCTs, not ethical or either not possible due to logistical reasons in some cases. Then we fall into Causal Inference's heuristic frameworks, such as <i>potential outcomes</i>, <i>matching</i>, and<i> time-series interventions </i>in imagining <i>counterfactuals and interventions</i>. These methods provide immensely successful toolbox for quantitative scientist where by systems do not have any known physical laws. <i>DAGs and do-calculus</i>, differentiates from all these approaches that try to move away from full heuristics. In this post we try to postulate this formally in mathematical terms in the context of causal inference over observational data framework. We established that <i>DAGs and do-calculus </i>bring mathematically more principled way of practicing causal inference akin to <i>theoretical physics attitude.</i> <div><p></p><p></p><p><b>Definition of Heuristic Causal Inference (</b><b style="font-style: italic;">HeuristicCI</b><b>) : Observational Data </b></p><p>Heuristics in general implies an algorithmic approximate solution, usually appear as numerical and statistical algorithms in causal inference whereby full RCT is not available. This can be formalised as follows, </p><p><b style="font-style: italic;">Definition (HeuristicCI) </b>Given dataset of $n-$dimensions $\mathscr{D} \in \mathbb{R}^{n}$ observation, having variates of $X=x_{i}$, with each having different sub-sets (categories within $x_{i}$), having at least one category of observations. We want to test <i>causal connection</i> between two distinct subsets of $X$, $\mathscr{S}_{1} , \mathscr{S}_{2}$, given an interventional versions or imagined counterfactual where by at least one of the subset is available, $\mathscr{S}_{1}^{int} , \mathscr{S}_{2}^{int}$. Using an algorithm $\mathscr{A}$ that processes dataset to test an <i>effect size $\delta$ </i>using a statistic $\beta$, as follows, $$ \delta= \beta(\mathscr{S}_{1} , \mathscr{S}_{1}^{int})-\beta(\mathscr{S}_{2} , \mathscr{S}_{2}^{int})$$ statistic $\beta$ can be result of a machine learning procedure as well and difference in $\delta$ is only a particular choice, i.e., such as Average Treatment Effect (ATE). The algorithm $\mathscr{A}$ is called <i>HeuristicCI.</i></p><p>Many of the non-DAGs and do-calculus methods directly falls into this category, such as <i>potential outcomes, uplift</i>, <i>matching</i> and <i>synthetic controls</i>. This definition could be quite obvious to practitioners that has a good handle in mathematical definitions. Moreover, <i>HeuristicCI </i>implies solely <i>data-driven approach</i> to causality inline with Hume's pure-empirical view-point. </p><p>Primary distinction in practicing DAGs that it brings causal ordering naturally [<span face="Arial, Tahoma, Helvetica, FreeSans, sans-serif" style="background-color: #fefdfa; caret-color: rgb(51, 51, 51); color: #333333; font-size: 13px;">suezen23pco</span>] with scientist's cognitive process encoded, where by <i>HeuristicCI</i> search for statistical effect size that has a causal component in fully data-driven way. However, a <i>HybridCI</i> would entails using DAGs and do-calculus in connection with data driven approaches.</p><p><b>Conclusion</b></p><p>In this short exposition, we introduced <i>HeuristicCI</i><i> </i>concept that category of methods that do not use DAGs and do-calculus explicitly in causal inference practice. However, we do not put a well designed RCTs in this category. Because, as a gold standard approach whereby<u> properly encoded </u>experimental design generates full interventional data reflecting scientist's domain knowledge. </p><p><b>References and Further reading</b></p><p></p><ul style="text-align: left;"><li>Looper repo : <a href="https://github.com/msuzen/looper">A resource list for causality in statistics, data science and physics</a></li><li>[<span face="Arial, Tahoma, Helvetica, FreeSans, sans-serif" style="background-color: #fefdfa; caret-color: rgb(51, 51, 51); color: #333333; font-size: 13px;">suezen23pco</span>] <a href="http://memosisland.blogspot.com/2023/09/causal-ordering-dags-.html">Practical Causal Ordering: Why weighted DAGs are powerful for causal inference?</a></li><li>Related Wikipedia articles </li><ul><li><a href="https://en.wikipedia.org/wiki/Propensity_score_matching">Propensity score matching</a> </li><li><a href="https://en.wikipedia.org/wiki/Synthetic_control_method">Synthetic control method</a></li><li><a href="https://en.wikipedia.org/wiki/Difference_in_differences">Difference in Differences</a></li><li><a href="https://en.wikipedia.org/wiki/Uplift_modelling">Uplift Modelling</a></li></ul></ul><p></p><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;">Please cite as follows:</span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"><u><br /></u></span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> @misc{suezen23hci, </span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> title = {</span>Mathematical Definition of Heuristic Causal Inference: What differantiates DAGs and do-calculus?<span style="font-family: inherit;">}, </span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> howpublished = {\url{</span><span face="Roboto, RobotoDraft, Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.52); font-size: 14px;">https://science-memo.blogspot.com/2023/11/heuristic-causal-inference.html</span><span style="font-family: inherit;">}, </span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> author = {Mehmet Süzen},</span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> year = {2023}</span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;">}</span> </div></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><br /></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><b>Postscript A: <span style="font-family: "Helvetica Neue";">Why Pearlian Causal Inference is very significant progress for empirical science?</span><span style="font-family: "Helvetica Neue";"> </span></b></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: "Helvetica Neue";"><br /></span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-emoji: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">Judea Pearl's framework for causality sometimes referred to as “mathematisation of causality”. However, “axiomatic foundations of causal inference” is fair identification, Pearl's contribution to the field is in par with Kolmogorov's axiomatic foundations of probability. Key papers of this axiomatic foundations are published in 1993 (back-doors) [1] and 1995 (do-calculus) [2]. </p><div><br /></div><div><p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-emoji: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">Original works of Axiomatic foundation for causal inference:</p>
<p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-emoji: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">[1] Pearl, J., “Graphical models, causality, and intervention,” <i>Statistical Science</i>, Vol. 8, pp. 266–269, 1993. </p>
<p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-emoji: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">[2] Pearl, J., “Causal diagrams for empirical research,” <i>Biometrika</i>, Vol. 82, Num. 4, pp. 669–710, 1995. </p></div></div>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.comBoston, MA, USA42.3600825 -71.0588801-9.974966517435071 -141.3713801 90 -0.7463800999999961tag:blogger.com,1999:blog-4550553973032503669.post-50155167994877690462023-04-01T09:30:00.007-07:002023-07-26T12:45:25.924-07:00Resolution of misconception of overfitting: Differentiating learning curves from Occam curves<p><b>Preamble</b> </p><p></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/b/ba/GUILHERME_DE_OCCAM_(1285_-_1347)._Fil%C3%B3sofo_ingl%C3%AAs%2C_tamb%C3%A9m_conhecido_como_o_%22doutor_invenc%C3%ADvel%22_(Doctor_Invincibilis)_e_o_%22iniciador_vener%C3%A1vel%22_(Venerabilis_Inceptor)%2C.jpg" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="800" data-original-width="566" height="320" src="https://upload.wikimedia.org/wikipedia/commons/b/ba/GUILHERME_DE_OCCAM_(1285_-_1347)._Fil%C3%B3sofo_ingl%C3%AAs%2C_tamb%C3%A9m_conhecido_como_o_%22doutor_invenc%C3%ADvel%22_(Doctor_Invincibilis)_e_o_%22iniciador_vener%C3%A1vel%22_(Venerabilis_Inceptor)%2C.jpg" width="226" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Occam (Wikipedia)</td></tr></tbody></table>A misconception that overfitted model can be identified with the amount of <i>generalisation gap </i>between model's training and test sets over its learning curves is still out there. Even in some prominent online lectures and blog posts, this misconception is now repeated without critical look. In general, this practice unfortunately diffuse into some academic papers and industrial, practitioners attribute poor generalisation to overfitting. We have provided a resolution of this via a new conceptual identification of complexity plots, so called <i>Occam's curves</i> differentiating from a learning curve. An accessible mathematical definitions here will clarify the resolution of the confusion. <p></p><p><b>Learning Curve Setting: Generalisation Gap </b></p><p>Learning curves explain how a given algorithm's generalisation improves over time or experience, originating from Ebbinghaus's work on human memory. We use inductive bias to express a model, as model can manifest itself in different forms from differential equations to deep learning.</p><p><u>Definition</u>: Given inductive bias $\mathscr{M}$ formed by $n$ datasets with monotonically increasing sizes $\mathbb{T} = \{|\mathbb{T}_{0}| > |\mathbb{T}_{1}| > ...> |\mathbb{T}_{n}| \}$. A learning curve $\mathscr{L}$ for $\mathscr{M}$ is expressed by the performance measure of the model over datasets, $\mathbb{p} = \{ p_{0}, p_{1}, ... p_{n} \}$, hence $\mathscr{L}$ is a curve on the plane of $(\mathbb{T}, p)$. </p><p>By this definition, we deduce that $\mathscr{M}$ learns if $\mathscr{L}$ increases monotonically. </p><p><i>A generalisation gap</i> is defined as follows. </p><p><u>Definition</u>: Generalisation gap for inductive bias $\mathscr{M}$ is the difference between its' learning curve $\mathscr{L}$ and the learning curve of the unseen datasets, i.e., so-called training, $\mathscr{L}^{train}$. The difference can be simple difference, or a measure differentiating the gap.</p><p>We conjecture the following. </p><p><i><u>Conjecture</u>: Generalisation gap can't identify if $\mathscr{M}$ is an overfitted model. Overfitting is about Occam's razor, and requires a pairwise comparison between two inductive biases of different complexities.</i></p><p>As conjecture suggests that generalisation gap is not about overfitting, despite the common misconception. Then, why the misconception? The misconception lies on the confusion of how to produce the curve that we could judge overfitting. </p><p><b>Occam Curves: Overfitting Gap [Occam's Gap] </b></p><div>In the case of generating Occam curves, a complexity measure $\mathscr{C_{i}}$ over different inductive biases $\mathscr{M_{i}}$ plays a role. Then the definition reads. </div><div><br /></div><div><u>Definition</u>: Given $n$ inductive bias $\mathscr{M_{i}}$ formed by $n$ datasets with monotonically increasing sizes $\mathbb{T} = \{|\mathbb{T}_{0}| > |\mathbb{T}_{1}| > ...> |\mathbb{T}_{n}| \}$. An Occam curve $\mathscr{O}$ for $\mathscr{M}$ is expressed by the performance measure of the model over complexity-dataset size functions $\mathbb{F} = f_{0}(\{|\mathbb{T}_{0}|, \mathscr{C_{0}}) > f_{1}(|\mathbb{T}_{1}| , \mathscr{C_{1}})> ...> f_{n}(|\mathbb{T}_{n}| , \mathscr{C_{n}}) $; Performance of each inductive bias reads $\mathbb{p} = \{ p_{0}, p_{1}, ... p_{n} \}$, hence Occam curve, $\mathscr{O}$ is a curve on the plane of $(\mathbb{F}, p)$. </div><div> </div><div>Given definition, producing Occam curves are more complicated than simply plotting test and train curves over batches. The ordering in $\mathbb{F}$ forms what is so-called goodness of rank.</div><div><br /></div><div><b>Summary and take home</b></div><div><b><br /></b></div><div>Resolution of misconception of overfitting lies in producing Occam curves to judge the bias-variance tradeoff, not the learning curves of a single model. </div><p><b>Further reading & notes</b></p><ul style="text-align: left;"><li>Further posts and a glossary : <a href="http://science-memo.blogspot.com/2022/12/overfitting-machine-learning-overgeneralisation.html">The concept of overgeneralisation and goodness of rank</a>.</li><li>Double decent phenomenon, it uses Occam's curves, not learning curves.</li><li>We use dataset size as an interpretation of<i> increasing experience,</i> there could be other ways of expressing a gained experience, but we take the most obvious evidence.</li></ul><div><div><span style="font-family: inherit;">Please cite as follows:</span></div><div><span style="font-family: inherit;"><u><br /></u></span></div><div><span style="font-family: inherit;"> @misc{suezen23rmo, </span></div><div><span style="font-family: inherit;"> title = {Resolution of misconception of overfitting: Differentiating learning curves from Occam curves}, </span></div><div><span style="font-family: inherit;"> howpublished = {\url{</span><a href="https://science-memo.blogspot.com/2023/04/Occam-curves.html">https://science-memo.blogspot.com/2023/04/Occam-curves.html</a><span style="font-family: inherit;">}}, </span></div><div><span style="font-family: inherit;"> author = {Mehmet Süzen},</span></div><div><span style="font-family: inherit;"> year = {2023}</span></div><div><span style="font-family: inherit;">}</span> </div></div><p></p>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-54730368017930985122023-02-25T11:19:00.003-08:002023-10-07T06:18:09.147-07:00 Loschimidt's Paradox and Causality: Can we establish Pearlian expression for Boltzmann's H-theorem?<p><b></b></p><b><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/a/ad/Boltzmann2.jpg" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="600" data-original-width="490" height="320" src="https://upload.wikimedia.org/wikipedia/commons/a/ad/Boltzmann2.jpg" width="261" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Boltzmann (Wikipedia)</td></tr></tbody></table><br />Preamble</b><p></p><p></p>Probably the most important achievement for humans is the ability to produce scientific discoveries, that helps us objectively understand how nature works and build artificial tools where no other species can. Entropy is an elusive concept and one of the crown achievements of human race. We question here if causal inference and Loschmidt's paradox can be reconciled. <p></p><p><b></b></p><b><br />Mimicking analogies are not physical</b><p></p><p>Before even try to understand what is a physical entropy, we should make sure that there is only one kind of physical entropy from thermodynamics, formulated by <i>Gibbs-Boltzmann ($S_{G}$ and $S_{B}$)</i>. Other entropies such as Shannon's information entropy are all analogies to physics, and mimicking concepts.</p><p><b>Why counting microstates are associated with time?</b></p><p>The following definition of entropy is due to Boltzmann but Gibbs' formulation tend to provide equivalence, technically different formulations aside, they are actually equivalent.</p><p><i><b>Definition 1</b>: An entropy of a macroscopic material is associated with larger number of states its constituted elements take different states, $\Omega$. This is associated with $S_{B}$, Boltzmann's entropy. </i></p><p>Now, as we know from basic thermodynamics classes that entropy change of a system can not decrease, so the time's arrow. </p><p><i><b>Definition 2</b>: Time's arrow is identified with change in entropy of material systems, that $\delta S \ge 0$.</i></p><p>We put aside the distinction between open and close systems and equilibrium and non-equilibrium dynamics, but concentrate on how come counting system's state's are associated with time's arrow? </p><p><b>Loschimidt's Paradox: Irreversible o</b><b>ccupancy on discrete states and causal inference</b></p><p>The core idea probably can be explained via discrete lattice and occupancy on them over chain of dynamics. </p><p><i><b>Conjecture 1</b>: Occupancy of $N$ items on $M$ discrete states, $M>N$, evolving with dynamical rules $\mathscr{D}$ necessarily increases $\Omega$, compare to the number of sampling if it were $M=N$. </i></p><p>This conjecture might explain the entropy increase, but irreversibility of the dynamical rule $\mathscr{D}$ is required addressing Loschimidt's Paradox, i.e., how to generate irreversible evolution given time-reversal dynamics. Actually, <i>do-calculus</i> may provide a language to resolve this, by inducing interventional notation on Boltzmann's H-theorem with Pearlian view. The full definition of H-function is a bit more involved, but here we summarise it in condensed form with a <i>do operator</i> version of it.</p><p><i><b>Conjecture 2 (H-Theorem do-conjecture)</b>: Boltzmann's H-function provides a basis for entropy increase, it is associated with conditional probability of a system $\mathscr{S}$ being in state $X$ on ensemble $\mathscr{E}$. Hence, $P(X|\mathscr{E})$. Then, an irreversible evolution from time-reversal dynamics should use interventional notation, $P(X|do(\mathscr{E}))$. Then information on how time reversal dynamics leads to time's arrow encoded on, how dynamics provides an interventional ensembles, $do(\mathscr{E})$.</i></p><p><b>Conclusion</b></p><p>We provided some hints on why would counting states lead to time's arrow, an irreversible dynamics. In the light of the development of mathematical language for causal inference in statistics, the concepts are converging. Along with understanding Loschmidt's Paradox via do-calculus, it can establish an asymmetric notation. Loschmidt's question is long standing problem in physics and philosophy with great practical implications in different physical sciences.</p><p><b>Further reading</b></p><p></p><ul style="text-align: left;"><li><a href="https://en.wikipedia.org/wiki/Loschmidt%27s_paradox"><i>Loschimid's Paradox</i></a></li><li><a href="https://en.wikipedia.org/wiki/H-theorem"><i>H-Theorem</i></a></li><li><i>do-Calculus</i> revisited, J. Pearl (2012) <a href="https://ftp.cs.ucla.edu/pub/stat_ser/r402.pdf">pdf</a></li><li>Causal Inference : <i><a href="https://github.com/msuzen/looper">Looper Repository for collection of resources.</a></i></li><li>H-theorem do-conjecture, M. Süzen, <a href="https://arxiv.org/abs/2310.01458">arxiv:2310.01458</a> (2023) </li></ul><div><span style="font-family: inherit;">Please cite as follows:</span></div><div><span style="font-family: inherit;"><u><br /></u></span></div><div><span style="font-family: inherit;"> @misc{suezen23lpc, </span></div><div><span style="font-family: inherit;"> title = {Loschimidt's Paradox and Causality: Can we establish Pearlian expression for Bolztmann's H-theorem?}, </span></div><div><span style="font-family: inherit;"> howpublished = {\url{https://science-memo.blogspot.com/2023/02/loschimidts-do-calculus.html}}, </span></div><div><span style="font-family: inherit;"> author = {Mehmet Süzen},</span></div><div><span style="font-family: inherit;"> year = {2023}</span></div><div><span style="font-family: inherit;">}</span> </div><div><br /></div><div>@article{suzen23htd,</div><div><span> title={H-theorem do-conjecture},</span><br /></div><div><span> author=</span>{Mehmet Süzen},</div><div> preprint={arXiv:2310.01458},</div><div> url = {https://arxiv.org/abs/2310.01458}</div><div> year={2023}</div><div>}</div>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-74672230571032049182023-02-18T09:46:00.002-08:002024-01-06T23:03:35.573-08:00 Insights into Bekenstein entropy with an intuitive mathematical definitions: A look into Thermodynamics of Black-holes <table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody>
<tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/f/f1/Bekenstein100_(cropped).JPG" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="800" data-original-width="596" height="200" src="https://upload.wikimedia.org/wikipedia/commons/f/f1/Bekenstein100_(cropped).JPG" width="148" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Jacob Bekenstein<br />
(Wikipedia)</td></tr>
</tbody></table>
<b>Preamble</b><br />
<br />
Thermodynamics of black holes has appeared as one of the most interesting areas of research in theoretical physics [<a href="https://www.amzn.com/dp/0226870278" target="_blank">Wald1994</a>], specially after <a href="https://en.wikipedia.org/wiki/First_observation_of_gravitational_waves">LIGO's massive success.</a> The striking results of <a href="https://en.wikipedia.org/wiki/Jacob_Bekenstein" target="_blank">Jacob Bekenstein</a> [<a href="https://journals.aps.org/prd/abstract/10.1103/PhysRevD.7.2333" target="_blank">Bekenstein1973</a>] in proposing a formulation of entropy for a black hole was on of the most striking turning point in building explanations for the thermodynamics of gravitational systems. <i>Bekenstein entropy</i> is defined to be so-called a phenomenological relationship and surprisingly easy to understand concept using basic dimensionality analysis. In this post, we will show how to understand the entropy of a black hole just using basic dimensionality analysis, fundamental physics constants and basic definition of entropy. <br />
<b><br /></b>
<b>Dimensions and scales</b><br />
<br />
Dimensionality analysis appears in many different areas of physics and engineering, from fluid dynamics to relativity. The starting point is to understand the concept of <i>dimensions</i>. Every <i>quantity</i> we measure in real life has a dimension. It means a quantity $\mathscr{Q}$ we obtain from a measurement $\mathscr{M}$ has a numeric value $v$ and associated unit $u$. $\mathscr{Q}=\langle v, u \rangle$ given $\mathscr{M}$. There are 3 distinct fundamental unit types length (L), time (T) and mass (M).<div>
<br /><b>Intuitive Bekenstein entropy (BE) for a black hole : Informal mathematical definition</b></div><div><b><br /></b></div><div>Black holes are astronomical objects that are not directly observable due to their mass condensed in a small area. The primary object we will use is something called Planck length $L_{p}$ and it implies physically possible smallest patch of the space-time, this is associated with the state of the black holes on their horizon. We won't define the Planck length here in detail but with the knowledge of fundamental physics constants and dimensional analysis we mentioned, one can get a constant value for this length. </div><div><br /></div><div><i><u>Definition</u></i>: Finite entropy $S_{f}$ of an object is associated with the number of states $\Omega$ a system can attain.</div><div><br /></div><div>If we combine this definition for a black hole entropy : </div><div><br /></div><div><i><u>Definition</u> </i>Finite entropy of a black-hole $S_{f}^{BH}$ is associated with the number of its states $\Omega$, number of elements on it's surface area of $A$. The elements are discretised with small patches $a_{p}=L_{p}^{2}$. Then intuitively, $\Omega$ yields to $A$ divided by $a_{p}$.</div><div> </div><div><b><i>Bekenstein entropy is not thermodynamic entropy alone and family of Bekenstein entropies</i></b></div><div><br /></div><div>The unit analysis tells us that $A$ has a dimension of length square. We intentionally omit any equality in the above definition upon $S_{f}^{BH}$ because, in practice <i>Bekenstein Entropy is not thermodynamic entropy alone. </i>The formulation usually presented as BE in general uses equality for the above approach. However this is not strictly thermodynamical alone, that's why we specify definitions as finite entropy and only express the relationship as association. Similarly any other constants as it can yield to different Bekenstein entropies such as introduction of new constants would yield to family of Bekenstein entropies.</div><div><br /></div><div><b>Why surface area defines states of a black-hole?</b></div><div><br /></div><div>This is an amazing question and Bekenstein's main contribution is to associate this to number of states of a black-hole on event horizon, i.e., point of of no return layer whereby ordinary matter can't return. The justification is that all other properties of black hole defines this surface. Here is the intuitive definition of states of black-hole.</div><div><br /></div><div><u>Definition</u> A surface area $\mathscr{A}$ is formed by the set of physical properties forming an ensembles. such as charge density, angular momentum. These ensembles indirectly samples thermodynamics ensembles. </div><div><br /></div><div>Even though intuition is there, this question might still be an open question further.</div><div><b><br /></b></div><div><b>Conclusion</b></div><div><b><br /></b></div><div>We provided the primary idea that Bekenstein tried to convey in his 1973 paper intuitively. However, we identify its thermodynamic limit is an open research area. Thermodynamic limit implies that taking infinite limit of both area and the discretised areas, even though it sounds that the values might converge to infinity, simultaneous limit would converge to a finite value for a physical matter. </div><div><br /></div><div><b>Primary Papers</b></div><div><ul style="text-align: left;"><li style="border: 0px; font-size: 14.4px; margin: 0px 0px 0.1em; outline: 0px; padding: 0px; vertical-align: baseline;">Bekenstein J.D.: Lettere al Nuovo Cimento, 4, 737, (1972)</li><li style="border: 0px; font-size: 14.4px; margin: 0px 0px 0.1em; outline: 0px; padding: 0px; vertical-align: baseline;"><span style="font-family: Georgia, "Times New Roman", Times, serif; font-size: 14.4px;"><a href="https://journals.aps.org/prd/abstract/10.1103/PhysRevD.7.2333">Bekenstein J.D.: Physical Review D, 7, 2333, (1973) </a></span></li><li style="border: 0px; font-size: 14.4px; margin: 0px 0px 0.1em; outline: 0px; padding: 0px; vertical-align: baseline;"><span style="font-family: Georgia, "Times New Roman", Times, serif; font-size: 14.4px;">Bekenstein J.D.: Physical Review D, 9, 3292 (1974)</span></li><li style="border: 0px; font-size: 14.4px; margin: 0px 0px 0.1em; outline: 0px; padding: 0px; vertical-align: baseline;"><span style="font-family: Georgia, "Times New Roman", Times, serif; font-size: 14.4px;">Bekenstein J.D.: Physical Review D, 12, 3077 (1975)</span></li></ul><div><b>Primary Book</b></div></div><div><ul style="text-align: left;"><li><a href="https://www.amazon.com/gp/product/0226870278">Wald, Quantum field theory in curved space times (1994) </a></li></ul></div><div><span style="font-family: Georgia, Times New Roman, Times, serif;"><span style="font-size: 14.4px;"><br /></span></span></div><div><span style="font-family: inherit;">Please cite as follows:</span></div><div><span style="font-family: inherit;"><u><br /></u></span></div><div><span style="font-family: inherit;"> @misc{suezen23ibe, </span></div><div><span style="font-family: inherit;"> title = {Insights into Bekenstein entropy with an intuitive mathematical definitions}, </span></div><div><span style="font-family: inherit;"> howpublished = {\url{https://science-memo.blogspot.com/2023/02/bekenstein-entropy.html}, </span></div><div><span style="font-family: inherit;"> author = {Mehmet Süzen},</span></div><div><span style="font-family: inherit;"> year = {2023}</span></div><div><span style="font-family: inherit;"> }
</span></div><blockquote><div></div></blockquote><div><b><br /></b></div><div><b>Postscript A: </b></div><div><b><br /></b></div><div><p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-size: 13px; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">Information can’t be destroyed</p>
<p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-size: 13px; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px; min-height: 15px;"><br /></p>
<p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-size: 13px; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">Proposals of that information is destroyed out of thin air is a red flag for </p>
<p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-size: 13px; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">any physical theory: this includes theories on evaporating black holes. </p>
<p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-size: 13px; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">Bekenstein’s insight in this direction that surface area is associated with </p>
<p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-size: 13px; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">entropy. The black-holes’ information in this context is quite different </p>
<p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-size: 13px; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">than the Shannon’s entropy. An evaporating black-hole, the </p>
<p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-size: 13px; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">area approaching to zero is not the same as information going to zero, </p>
<p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-size: 13px; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">surface area is a function of physical properties of the stellar object </p>
<p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-size: 13px; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">that bound by conservation laws in their interaction with their </p>
<p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-size: 13px; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">surrounding. Hence, the information is preserved even if area goes </p>
<p style="font-family: "Helvetica Neue"; font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-size: 13px; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;">to zero.</p><div><br /></div><div><b><br /></b></div><div><b><br /></b></div></div>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-65864524428494013172023-01-28T10:07:00.002-08:002023-01-28T22:42:40.488-08:00Misconceptions on non-temporal learning: When do machine learning models qualify as prediction systems?<p><b>Preamble</b></p><p></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/0/0b/Ybc7289-bw.jpg" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="315" data-original-width="338" height="186" src="https://upload.wikimedia.org/wikipedia/commons/0/0b/Ybc7289-bw.jpg" width="200" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span> Babylonian Tablet for <br />square root of 2.<br /> (Wikipedia)</span></td></tr></tbody></table>Prediction implies a mechanics, as in knowing a form of a trajectory over time. Strictly speaking a predictive system implies knowing a solution to the path, set of variable depending on time, time evolution of the system under consideration. Here, we define semi-informally how a prediction system is defined mathematically and show how non-temporal learning can be mapped into a prediction system. <p></p><p><b>Temporal learning : Recurrence, </b><b>trajectory and sequences</b></p><p>A trajectory can be seen as a function of time, identified in recurrence manner. It means $x(t_{i})=f(x_{i-1})$. However, this is one of the possible definitions. The physical equivalent of this appears as a solution to ordinary differential equation, such as the velocity $v(t) = dx(t)/dt$, recurrence on its solution. On the other hand machine learning, an empirical approach is taken and a sequence data such as natural language or a log events occurring in sequence. Any modelling on such data is called temporal learning. This includes classical time-series algorithms, gated units in deep learning and differential equations.</p><p><u>Definition</u>: A prediction system $\mathscr{F}$ that is build with data $D$ but utilised for a data that is not used in building it $D'$, qualified as such if both $D$ and $D'$ are temporal sets and output of the system is a horizon $\mathbb{H}$, that is a sequence. </p><p><b>Using non-temporal supervised learning is interpolation or extrapolation</b></p><p>Often practice in industry to turn temporal interactions into flat set of data vectors, $v_{i}$, $i$ corresponds to a time point or an arbitrary property of the dataset resulting in breaking the temporal associations and causal links. This could also manifest as set of images with some labels which has no ordering or associational property in the dataset. Even though our system build upon these non-temporal datasets, indeed it constituted a learning systems as interpolation or extrapolation. Their utility in using them for $D'$, strictly speaking does not qualify as prediction systems. </p><p><b>Mapping with pre-processing</b></p><p>A mapping indeed possible from non-temporal data to a temporal one, if their original form is not in temporal form yet. This is been studied in complexity literature. This requires an algorithm to map flattened data vectors we mentioned into a sequence data. </p><p><b>Mapping with Causality</b></p><p>A distinct models from causal inference are qualified as predictive systems even if they are trained on non-temporal data, because causality establishes a temporal learning.</p><p><b>Non-temporal modals: Do they still learn?</b></p><p>Even though, we exclude non-temporal model utilisation as non-predictive systems, they still classified as learned models. Because their outputs are generated by a learning procedure. </p><p><b>Conclusion</b></p><p>Differentiation among temporal and non-temporal learning is provided in associational manner. This results into definition of a prediction system, that excludes non-temporal machine learning models: such as models for unlinked set of vectors, a set of numbers mapped from any data modality. </p><p><b>Further reading & postscript notes</b></p><p></p><ul style="text-align: left;"><li><span style="font-family: Georgia, Utopia, "Palatino Linotype", Palatino, serif;"><a href="http://memosisland.blogspot.com/2020/12/practice-causal-inference-conventional.html">Practice causal inference: Conventional supervised learning can't do inference</a></span></li><li>Causal inference : Editor's selections from <a href="https://github.com/msuzen/looper/blob/master/looper.md#editors-selection">the looper repo</a>.</li><li>Causal models usually are not train but validated or so called discovered.</li></ul><p></p><p><br /></p><p></p><p></p>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-65591959822864610512022-12-20T10:52:00.003-08:002023-03-25T14:26:58.788-07:00The concept of overgeneralisation and goodness of rank : Overfitting is not about comparing training and test learning curves<p><b>Preamble</b> </p><p></p><span><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; font-family: inherit;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Image-Disney_Concert_Hall_by_Carol_Highsmith_edit-2.jpg/1920px-Image-Disney_Concert_Hall_by_Carol_Highsmith_edit-2.jpg" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="625" data-original-width="800" height="250" src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Image-Disney_Concert_Hall_by_Carol_Highsmith_edit-2.jpg/1920px-Image-Disney_Concert_Hall_by_Carol_Highsmith_edit-2.jpg" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><p> <span style="font-size: x-small;"> Walt Disney Hall,<br /> </span><span style="font-size: x-small;">Los Angeles (Wikipedia)</span></p></td></tr></tbody></table><br /><span style="font-family: inherit;">Unfortunately, it is still thought in machine learning classes that overfitting can be detected by comparing training and test learning curves on the single model's performance. The origins of this </span><i style="font-family: inherit;"><b>misconception</b> i</i><span style="font-family: inherit;">s unknown. Looks like an </span><i style="font-family: inherit;">urban legend </i><span style="font-family: inherit;">has been diffused into main practice and even in academic works the misconception taken granted. </span><i style="font-family: inherit;">Overfitting's </i><span style="font-family: inherit;">definition appeared to be inherently about comparing complexities of two (or more) models. Models manifest themself as </span><i style="font-family: inherit;">inductive biases </i><span style="font-family: inherit;">modeller or data scientist inform in their tasks. This makes overfitting in reality a Bayesian concept at its core. It is </span><i style="font-family: inherit;">not</i><span style="font-family: inherit;"> about comparing training and test learning curves that if model is following a noise, but </span><i style="font-family: inherit;">pairwise model comparison-testing procedure </i><span style="font-family: inherit;">to select more plausable belief among our beliefs that has the least information: </span><span style="background-color: white; color: #202122;"><i style="caret-color: rgb(32, 33, 34); font-family: inherit;">entities should not be multiplied beyond necessity i.e., Occam's razor. </i><span style="font-family: inherit;">We introduce a new concept in clarifying this practically, </span><i>goodness</i><i style="caret-color: rgb(32, 33, 34); font-family: inherit;"> of rank</i><span style="font-family: inherit;"> to distinguish from well known </span><i style="caret-color: rgb(32, 33, 34); font-family: inherit;">goodness of fit</i><span style="font-family: inherit;">, and clarify concepts and provide steps to attribute models with overfitted or under-fitted models.</span></span></span><p></p><p><span style="color: #202122;"><span style="background-color: white; caret-color: rgb(32, 33, 34);"><b>Poorly generalised model : Overgeneralisation or under-generalisation</b></span></span></p><p><span style="color: #202122;"><span style="background-color: white;">The practice that is described in machine learning classes, and practiced in industry that overfitting is about your model following training set closely but fails to generalised in test set. This is not overfitted model but a model that fails to generalise: a phenomena should be called <i>Overgeneralisation</i> (or <i>under-generalisation</i>). </span></span></p><p><span style="color: #202122;"><span style="background-color: white; caret-color: rgb(32, 33, 34);"><b>A procedure to detect overfitted model : Goodness of rank</b></span></span></p><p><span style="color: #202122;"><span style="background-color: white; caret-color: rgb(32, 33, 34);">We have provided complexity based abstract description of model selection procedure, here as <a href="https://science-memo.blogspot.com/2022/10/overfitting-is-about-complexity-ranking.html">complexity ranking</a>: we will repeat this procedure with identification of the overfilled model explicitly.</span></span></p><div>In the following steps a sketch of an algorithmic recipe for complexity ranking of inductive biases via informal steps, overfitted model identification explicitly:</div><p><span style="color: #202122;"><span style="background-color: white;"></span></span></p><div><ol style="text-align: left;"><li>Define a complexity measure $\mathscr{C}$($\mathscr{M}$) over an inductive bias.</li><li>Define a generalisation measure $\mathscr{G}$($\mathscr{M}$, $\mathscr{D}$) over and inductive bias and dataset.</li><li>Select a set of inductive biases, at least-two, $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$.</li><li>Produce complexity and generalisation measures on ($\mathscr{M}$, $\mathscr{D}$): Here for two inductive biases: $\mathscr{C}_{1}$, $\mathscr{C}_{2}$, $\mathscr{G}_{1}$, $\mathscr{G}_{2}$.</li><li>Ranking of $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$: $argmax \{ \mathscr{G}_{1}, \mathscr{G}_{2}\}$ and $argmin \{ \mathscr{C}_{1}, \mathscr{C}_{2}\}$ </li><li><b>$\mathscr{M}_{1}$ is an overfitted model compare to <b>$\mathscr{M}_{2}$ </b> </b>if $\mathscr{G}_{1} <= \mathscr{G}_{2}$ and $\mathscr{C}_{1} \gt \mathscr{C}_{2}$. </li><li><b>$\mathscr{M}_{2}$ is an overfitted model <b>compare to <b>$\mathscr{M}_{1}$</b></b> </b>if $\mathscr{G}_{2} <= \mathscr{G}_{1}$ and $\mathscr{C}_{2} \gt \mathscr{C}_{1}$.</li><li><b>$\mathscr{M}_{1}$ is an underfitted model compare to <b>$\mathscr{M}_{2}$ </b> </b>if $\mathscr{G}_{1} < \mathscr{G}_{2}$ and $\mathscr{C}_{1} < \mathscr{C}_{2}$.</li><li><b>$\mathscr{M}_{2}$ is an underfitted model compare to <b>$\mathscr{M}_{1}$ </b> </b>if $\mathscr{G}_{2} < \mathscr{G}_{1}$ and $\mathscr{C}_{2} < \mathscr{C}_{1}$.</li></ol><div>If two model has the same complexity, then much better generalised model should be selected, in this case we can't conclude that either model is overfitted but generalised differently. Remembering that overfitting is about <i>complexity ranking : Goodness of rank.</i></div><div><br /></div><div><b>But overgeneralisation sounded like overfitting, isn't it?</b></div><div><br /></div><div>Operationally overgeneralisation and overfitting implies two different things. Overgeneralisation operationally can be detected with a single model. Because, we can measure the generalisation performance of the model<i> alone with data,</i> in statistical literature this is called <i>goodness of fit</i>. Moreover overgeneralisation can also be called under-generalisation, as both implies poor generalisation performance.</div><div><br /></div><div>However, overfitting implies a model overly performed compare to an other model i.e., model overfits but compare to what? Practically speaking, overgeneralisation can be detected via holdout method, but not overfitting. Overfitting goes beyond goodness of fit to g<i>oodness of rank </i>as we provided recipe as pairwise model comparison.</div><div><br /></div><div><b>Conclusion</b></div></div><div><br /></div><div>The practice of comparing training and test learning curves for overfitting diffused into machine learning so deeply, the concept is almost always thought a bit in a fuzzy-way, even in distinguished lectures explicitly. Older textbooks and papers correctly identifies overfitting as comparison problem. As a practitioner, if we bear in mind that overfitting is about complexity ranking and it requires more than one model or inductive bias in order to be identified, then we are in better shape in selecting better model. Overfitting can not be detected via data alone on a single model. </div><div><br /></div><div><div><span style="background-color: white;"><b>Further reading</b></span></div><div><span style="background-color: white;"><br style="box-sizing: inherit; line-height: inherit;" /></span><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">Some of the posts, reverse chronological order, that this blog have tried to convey what overfitting entails and its general implications. </span></div><div><ul style="text-align: left;"><li><div><span style="font-family: inherit;"><a href="https://science-memo.blogspot.com/2022/10/overfitting-is-about-complexity-ranking.html">Overfitting is about complexity ranking of inductive biases : Algorithmic recipe</a></span></div></li><li><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><a href="http://science-memo.blogspot.com/2022/06/empirical-risk-minimisation-learning-curve.html">Empirical risk minimization is not learning :</a></span><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><a href="http://science-memo.blogspot.com/2022/06/empirical-risk-minimisation-learning-curve.html">A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning</a></span></li><li><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><a href="https://science-memo.blogspot.com/2021/03/critical-look-on-why-deployed-machine.html">Critical look on why deployed machine learning model performance degrade quickly.</a></span></li><li><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><a href="http://memosisland.blogspot.com/2019/12/bringing-back-occams-razor-to-modern.html">Bringing back Occam's razor to modern connectionist machine learning.</a></span></li><li><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><a href="https://www.kdnuggets.com/2017/08/understanding-overfitting-meme-supervised-learning.html">Understanding overfitting: an inaccurate meme in Machine Learning</a></span></li></ul><div><b>Glossary</b> </div></div></div><div><br /></div><div>To make things clear, we provide concept definitions.</div><div><br /></div><div><b>Generalisation </b>A concept that if model can perform as good as the data it has not seen before, however <i>seen</i> here is a bit vague, it could have seen data points that are close to the data would be better suited in the context of supervised learning as oppose to compositional learning.</div><div><br /></div><div><b>Goodness of fit </b>An approach to check if model is generalised well. </div><div><br /></div><div><b>Goodness of rank </b>An approach to check if model is overfitted or under-fitted comparable to other models.</div><div><br /></div><div><b>Holdout method </b>A metod to build a model on the portion of available data and measure the goodness of fit on the holdout part of the data, i.e., test and train.</div><div><b><br /></b></div><div><b>Inductive bias </b>A set of assumptions data scientist made in building a representation of the real world, this manifest as model and the assumptions that come with a model.</div><div><br /></div><div><b>Model</b> A model is a biased view of the reality from data scientist. Usually appears as a function of observables $X$ and parameters $\Theta$, $f(X, \Theta)$. The different values of $\Theta$ do not constitute a different model. See also <a href="https://projecteuclid.org/journals/annals-of-statistics/volume-30/issue-5/What-is-a-statistical-model/10.1214/aos/1035844977.full "><span style="font-family: inherit;"> What is a statistical model?, Peter McCullagh</span></a><span style="font-family: "Helvetica Neue"; font-size: 13px;"> </span></div><div><br /></div><div><b>Occam's razor (<span style="background-color: white; color: #202122;">Principle</span><span style="background-color: white; caret-color: rgb(32, 33, 34); color: #202122;"> of </span><span style="background-color: white; color: #202122;">parsimony</span>) </b> A principle that less complex explanation reflects reality better. <span style="background-color: white;"><i style="caret-color: rgb(32, 33, 34); color: #202122;">Entities should not be multiplied beyond necessity. </i><i style="caret-color: rgb(32, 33, 34); color: #202122;"> </i></span></div><div><span style="background-color: white;"><i style="caret-color: rgb(32, 33, 34); color: #202122;"><br /></i></span></div><div><span style="background-color: white;"><span style="caret-color: rgb(32, 33, 34); color: #202122;"><b>Overgeneralisation (Under-generalisation)</b> </span><span style="caret-color: rgb(32, 33, 34); color: #202122;">If we have a good performance on the training set but very bad performance on the test set, model said to overgeneralise or under-generalise; as a result of goodness of fit testing, i.e., comparing learning curves over test and train datasets.</span></span></div><div><span style="background-color: white;"><span style="caret-color: rgb(32, 33, 34); color: #202122;"><br /></span></span></div><div><span style="background-color: white;"><span style="caret-color: rgb(32, 33, 34); color: #202122;"><b>Regularisation</b> An approach to augment model to improve generalisation.</span></span></div><div><span style="background-color: white;"><i style="caret-color: rgb(32, 33, 34); color: #202122;"><br /></i></span></div><div><b>Postscript Notes</b></div><div><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;"><br /></span></div><div><i><b><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">Note: Occam’s razor is a ranking problem: Generalisation is not</span><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;"> </span></b></i></div><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">The holy grail of machine learning in practice is hold-out methods. We want to make sure that we don’t overgeneralise. However, a misconception has been propagated that overgeneralisation is mistakenly thought of as synonymous with overfitting. Overfitting has a different connotation as ranking different models rather than measuring the generalisation ability of a single model. The generalisation gap between training and test sets is not about Occam’s razor. </span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" />msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-35982284582851128492022-12-05T10:16:00.001-08:002023-03-05T04:17:50.674-08:00The conditional query fallacy: Applying Bayesian inference from discrete mathematics perspective<h3 style="text-align: left;"><b>Preamble</b></h3><p></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/en/9/9d/The_Tilled_Field.jpg" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="564" data-original-width="800" height="226" src="https://upload.wikimedia.org/wikipedia/en/9/9d/The_Tilled_Field.jpg" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span> </span>The Tilled Field<span>, <br />Joan Miró <br />(Wikipedia)</span></td></tr></tbody></table>One of the core concepts in data sciences is conditional probabilities, $p(x|y)$ appear as logical description of many of the tasks, such as formulating <i>regression </i>or as a core concept in<i> Bayesian Inference</i>. However, there is operationally no special meaning of <i>a conditional </i>or<i> joint </i>probabilities as their arguments are no more than a compositional event statements. This raise a question:<i> Is there any fundamental relationship between Bayesian Inference and discrete mathematics that is practically relevant to us as practitioners? </i>Since, both topics are based on discrete statements returning a Boolean variables. Unfortunately, the answer to this question is a rabbit hole and probably even an open research. There is no clearly established connections between discrete mathematics fundamentals and Bayesian Inference. <p></p><h3 style="text-align: left;"><b>Statement mappings as definition of probability </b></h3><p>Statement is a logical description of some events, or set of events. Let's have a semi-formal description of such statements.</p><p><b>Definition</b>: A mathematical or logical statement formed with boolean relationships $\mathscr{R}$ (conjunctions) among set of events $\mathscr{E}$, so a statement $\mathbb{S}$ is formed with at least a tuple of $\langle \mathscr{R}, \mathscr{E} \rangle$. </p><p>Relationships can be any binary operator and events could explain anything perceptional, i.e., a discretised existence. This is the core discrete mathematics and almost all problems in this domain formed in this setting from defining functions to graph theory. A probability is no exception and definition naturally follows, as so called <i>statement mapping.</i></p><p><b>Definition</b>: A probability $\mathbb{P}$ is a statement mapping, $\mathbb{P}: \mathbb{S} \rightarrow [0,1]$. </p><p>The interpretation of this definitions that a logical statement is always <span style="font-family: courier;">True</span> if probability is 1 and always <span style="font-family: courier;">False</span> if it is 0. However, having conditionals based on this is not that clear cut.</p><h3 style="text-align: left;"><b>Conditional Query Fallacy </b></h3><p>A non-commutative statement may imply, reversing the order of statements should not yield to the same filtered set on the data for Bayesian Inference. However, Bayes' theorem would have a fallacy for statement mappings for conditionals in this sense. </p><p><b>Definition</b>: The c<i>onditional query fallacy </i>is defined as one can not update belief in probability, because reversing order of statements in conditional probabilities halts Bayes' update, i.e., back to back query results into the same dataset for inference.</p><p>At first glance, this appears as a Bayes' rule does not support commutative property, practically posterior being equal to likelihood. However, this fallacy appears to be <i>a notational misdirection</i>. Inference on the filtered dataset back to back constituting a <i>conditional fallacy i.e., when a query language is used to filter data to get A|B </i>and<i> B|A yielding to the same dataset regardless of filtering order. </i></p><p>Although, in inference with <i>data</i>, likelihood is actually not a conditional probability, strictly speaking and not a filtering operation. It is merely a measure of update rule. We compute likelihood by multiplying values obtained by i.i.d. samples inserted into conjugate prior, a distribution is involved. Hence, the likelihood computationally is not really a reversal of conditional as in $P(A|B)$ written as reversed, $P(B|A)$. </p><h3 style="text-align: left;"><b>Outlook</b></h3><p>In computing conditional probabilities for Bayesian Inference, our primary assumption is that conditional probabilities; likelihood and posterior are not identical. Discrete mathematics only allows Bayesian updates, if time evolution is explicitly stated with non-commutative statements for conditionals.</p><p>Going back to our initial question, indeed there is a deep connection between the fundamentals of discrete mathematics and Bayesian belief update on events as logical statements. The fallacy sounds a trivial error in judgement but (un)fortunately goes into philosophical definitions of probability that simultaneous tracking of time and sample space is not encoded in any of the notations explicitly, making statement filtering definition of probability a bit shaky.</p><h3 style="text-align: left;"><b>Glossary of concepts</b></h3><p><b>Statement Mapping </b>A given set of mathematical statements mapped into a domain of numbers.</p><p><b>Probability</b> A statement mapping, where domain is $\mathscr{D} = [0,1]$.</p><p><b>Conditional query fallacy</b> Differently put than the above definition. Thinking that two conditional probabilities as reversed statements of each other in Bayesian inference, yields to the same dataset regardless of time-ordering of the queries.</p><h3 style="text-align: left;"><b>Notes and further reading</b></h3><ul style="text-align: left;"><li>Fallacy is one computes $P(A|B)=P(B|A)$, while filtering results into identical datasets. Correction would be that, one needs to use different sample sizes for reverse statement or compute joints and marginals separately on their own filtered datasets. Use the first filtering sample size in computing the probability not the total.</li><li>Here, discrete mathematics we refer to appears within arguments of probability. The discussion of discrete parameter estimations are a different topic. Gelman discusses this, <a href="https://statmodeling.stat.columbia.edu/2022/09/30/bayesian-inference-for-discrete-parameters-and-bayesian-inference-for-continuous-parameters-are-these-two-completely-different-forms-of-inference/">here</a>.</li><li><a href="https://en.wikipedia.org/wiki/Conjunction_fallacy">Conjunction Fallacy</a></li><li><a href="https://en.wikipedia.org/wiki/Probability_interpretations">Probability Interpretations</a></li><li><a href="http://science-memo.blogspot.com/2022/07/bayesian-conditional-noncommutative.html" style="font-family: Georgia, Utopia, "Palatino Linotype", Palatino, serif;">Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra</a> M. Süzen (2022)</li><li><a href="http://www.stat.columbia.edu/~gelman/research/published/physics.pdf">Holes in Bayesian Statistics</a> Gelman-Yao (2021) : This is a beautifully written article. Specially, proposal that <i>context dependence </i>should be used instead of <i>subjective</i></li></ul>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-11413262462104096902022-11-15T12:28:00.000-08:002022-11-15T12:28:18.273-08:00Differentiating ensembles and sample spaces: Alignment between statistical mechanics and probability theory<p><b>Preamble </b></p><p><span data-preserver-spaces="true" style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;">Sample space is the primary concept introduced in any probability and statistics books and in papers. However, there needs to be more clarity about what constitutes a sample space in general: there is no explicit distinction between the unique event set and the replica sets. The resolution of this ambiguity lies in the concept of </span><em style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;">an ensemble. </em><i> </i>The concept is first introduced by American theoretical physicist and engineer Gibbs in his book <i><a href="https://en.wikipedia.org/wiki/Elementary_Principles_in_Statistical_Mechanics">Elementary principle of statistical mechanics</a>. </i><em style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"> </em><span data-preserver-spaces="true" style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;">The primary utility of an ensemble is a mathematical construction that differentiates between samples and how they would form extended objects. </span></p><p><span data-preserver-spaces="true" style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;">In this direction, we provide the basics of constructing ensembles in a pedagogically accessible way from sample spaces that clears up a possible misconception. This usage of ensemble prevents the overuse of the term </span><em style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;">sample space</em><span data-preserver-spaces="true" style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"> for different things. We introduce some basic formal definitions.</span></p><p></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/6/66/Gibbs-Elementary_principles_in_statistical_mechanics.png" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="800" data-original-width="502" height="400" src="https://upload.wikimedia.org/wikipedia/commons/6/66/Gibbs-Elementary_principles_in_statistical_mechanics.png" width="250" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"> Figure: <i>Gibbs's book<br /> introduced the concept of <br />ensemble (Wikipedia).</i></td></tr></tbody></table><p></p><p><b>What Gibbs's had in mind by constructing statistical ensembles?</b></p><div>A statistical ensemble is a mathematical tool that connects statistical mechanics to thermodynamics. The concept lies in defining microscopic states for molecular dynamics; in statistics and probability, this corresponds to a set of events. Though these events are different at a microscopic level, they are sampled from a single thermodynamics ensemble, a representative of varying material properties or, in general, a set of independent random variables. In dynamics, micro-states samples an ensemble. This simple idea has helped Gibbs to build a mathematical formalism of statistical mechanics companion to Boltzmann's theories.</div><p><b>Differentiating sample space and ensemble in general</b></p><p>The primary confusion in probability theory on what constitutes a samples space is that there is no distinction between primitive events or events composed of primitive events. We call both sets sample space. This terminology easily overlooked in general as we concentrate on events set but not the primitive events set in solving practical problems. </p><p><b>Definition: </b><i>A primitive event</i> $\mathscr{e}$ implies a logically distinct unit of experimental realisation that has not composed of any other events.</p><p><b>Definition</b>: <i>A sample space </i>$\mathscr{S}$ is a set formed by all $N$ distinct primitive events $\mathscr{e}_{i}$. </p><p>By this definition, regardless of how many fair coins are used or if a coin toss in a sequence for the experiment, the sample space is always ${H,T}$, because these are the most primitive distinct events a system can have, i.e., a single coin outcomes. However, the statistical ensemble can be different. For example for two fair coins or coin toss in sequence of length two, corresponding ensemble of system size two reads ${HH, TT, HT, TH}$. Then, the definition of ensemble follows. </p><p><b>Definition</b>: <i>An ensemble</i> $\mathscr{E}$ is a set of ordered set of primitive events $\mathscr{e}_{i}$. These event sets can be sampled with replacement but order matters, i.e., $ \{e_{i}, e_{j} \} \ne \{e_{j}, e_{i} \}$, $i \ne j$.</p><p>Our two coin example's ensemble should be formally written as $\mathscr{E}=\{\{H,H\}, \{T,T\}, \{H,T\}, \{T,H\}\}$, as order matters members $HT$ and $TH$ are distinct. Obviously for a single toss ensemble and a sample space will be the same. </p><p><b>Ergodicity makes the need for differentiation much more clear : Time and ensemble averaging </b></p><p>The above distinction makes building time and ensemble averaging much easier. The term ensemble averaging is obvious as we know what would be the ensemble set and averaging over this set for a given observable. Time averaging then could be achieved by curating a much larger set by resampling with replacement from the ensemble. Note that the resulting time-average value would not be unique, as one can generate many different sample sets from the ensemble. However, bear in mind that the definition of how to measure convergence to ergodic regime is not unique.</p><p><b>Conclusion</b></p><p>Even though the distinction we made sounds very obscure, this alignment between statistical mechanics and probability theory may clarify the conception of ergodic regimes for general practitioners.</p><p><b>Further reading</b></p><p></p><ul style="text-align: left;"><li><a href="https://science-memo.blogspot.com/2020/01/a-practical-understanding-of-ergodicity.html">Practical understanding of ergodicity</a>.</li><li><a href="https://science-memo.blogspot.com/2022/05/ergodic-regime-not-process.html">A misconception in ergodicity: Identify ergodic regime not ergodic process</a></li></ul><p></p><p><br /></p><p><br /></p><p><br /></p>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.comNew Haven, CT, USA41.308274 -72.927883512.535305638239166 -108.0841335 70.081242361760829 -37.771633499999993tag:blogger.com,1999:blog-4550553973032503669.post-91197832189607504182022-10-25T15:22:00.001-07:002022-10-25T15:27:57.173-07:00 Overfitting is about complexity ranking of inductive biases : Algorithmic recipe<div><b>Preamble</b></div><div><br /></div><div><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/9/90/Man_In_The_Moon2.png" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="800" data-original-width="320" height="400" src="https://upload.wikimedia.org/wikipedia/commons/9/90/Man_In_The_Moon2.png" width="160" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span> Figure: Moon patterns <br />human brain<br /> invents. (Wikipedia)</span></td></tr></tbody></table>Detecting overfitting is inherently a comparison problem of the complexity of multiple objects, i.e., models or an algorithm capable of making predictions. A model is overfitted (<i>underfitted</i>) if we only compare it to another model. Model selection involves comparing multiple models with different complexities. The summary of this approach with basic mathematical definitions is given here.</span></div><div><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><div style="caret-color: rgb(0, 0, 0);"><span style="caret-color: rgba(0, 0, 0, 0.9);"><br /></span></div></span></div><div><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><b>Misconceptions: </b></span><b><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">P</span><i style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">oor generalisation is not synonymous with overfitting.</i><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"> </span></b></div><div><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><br /></span><div><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">None of these techniques would prevent us from overfitting: Cross-validation, having more data, early stopping, and comparing test-train learning curves are </span><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">all about generalisation. Their purpose is </span><u style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">not</u><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"> to detect overfitting.</span></div><div><span style="background-color: white;"><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); line-height: inherit;" /></span><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">We need at least two different models, i.e., two different inductive biases, to judge which model is overfitted. One distinct approach in deep learning, called dropout, prevents overfitting while it alternates between multiple models, i.e., multiple inductive bias. For judgment, dropout implementation has to compare those alternating model test performances during training to judge overfitting. </span></div><div><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><br /></span></div><div><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white;"><div style="caret-color: rgb(0, 0, 0);"><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="caret-color: rgba(0, 0, 0, 0.9);"><b>What is an inductive bias? </b></span></div><div style="caret-color: rgb(0, 0, 0);"><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="caret-color: rgba(0, 0, 0, 0.9);"><br /></span></div><div><span style="caret-color: rgba(0, 0, 0, 0.9);">There are multiple inceptions of inductive bias. Here, we concentrate on a parametrised model, $\mathscr{M}(\theta)$ on a dataset $\mathscr{D}$, the selection of a model type, or modelling approach, usually manifest as a functional form $\mathscr{M}=f(x)$ or as a function approximation, i.e., for example neural network, are all manifestation of inductive biases. Different parameterisation of model learned on the subsets of the dataset are still the same inductive bias.</span></div></span></div><div><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><br /></span></div><div><b>Complexity ranking of inductive biases: An Algorithmic recipe </b></div><div><b><br /></b></div><div>We are sketching out an algorithmic recipe for complexity ranking of inductive biases via informal steps:</div><div><ol style="text-align: left;"><li>Define a complexity measure $\mathscr{C}$($\mathscr{M}$) over an inductive bias.</li><li>Define a generalisation measure $\mathscr{G}$($\mathscr{M}$, $\mathscr{D}$) over and inductive bias and dataset.</li><li>Select a set of inductive biases, at least-two, $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$.</li><li>Produce complexity and generalisation measures on ($\mathscr{M}$, $\mathscr{D}$): Here for two inductive biases: $\mathscr{C}_{1}$, $\mathscr{C}_{2}$, $\mathscr{G}_{1}$, $\mathscr{G}_{2}$.</li><li>Ranking of $\mathscr{M}_{1}$ and $\mathscr{M}_{2}$: $argmax \{ \mathscr{G}_{1}, \mathscr{G}_{2}\}$ and $argmin \{ \mathscr{C}_{1}, \mathscr{C}_{2}\}$</li></ol><div>The core concept appears as when generalisations are close enough we pick out the inductive bias that is less complex. </div></div><div><br /></div><div><span style="background-color: white;"><b>Conclusion & Outlook</b></span></div><div><span style="background-color: white;"><br /></span></div><div><span style="background-color: white;">In practice, probably due to hectic delivery constraints, or mere laziness, we still rely on simple holdout method to build models, only single test and train split, not even learning curves, specially in deep learning models without practicing Occam's razor. A major insight in this direction appears to be that, holdout approach can only help us to detect generalisation, not overfitting. We clarify this via the concept of<i> inductive bias</i> distinguishing that different parametrisation of the same model doesn't change the inductive bias introduced by the modelling choice. </span></div><div><span style="background-color: white;"><br /></span></div><div><span style="background-color: white;">In fact, due to resource constraints of model life-cycle, i.e., energy consumption and cognitive load of introducing a complex model, practicing proper Occam's razor: complexity ranking of inductive biases, is much more important than ever for sustainable environment and human capital.</span></div><div><span style="background-color: white;"><br /></span></div><div><span style="background-color: white;"><b>Further reading</b></span></div><div><span style="background-color: white;"><br style="box-sizing: inherit; line-height: inherit;" /></span><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">Some of the posts, reverse chronological order, that this blog have tried to convey what overfitting entails and its general implications. </span></div><div><ul style="text-align: left;"><li><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><a href="http://science-memo.blogspot.com/2022/06/empirical-risk-minimisation-learning-curve.html">Empirical risk minimization is not learning :</a></span><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><a href="http://science-memo.blogspot.com/2022/06/empirical-risk-minimisation-learning-curve.html">A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning</a></span></li><li><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><a href="https://science-memo.blogspot.com/2021/03/critical-look-on-why-deployed-machine.html">Critical look on why deployed machine learning model performance degrade quickly.</a></span></li><li><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><a href="http://memosisland.blogspot.com/2019/12/bringing-back-occams-razor-to-modern.html">Bringing back Occam's razor to modern connectionist machine learning.</a></span></li><li><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><a href="https://www.kdnuggets.com/2017/08/understanding-overfitting-meme-supervised-learning.html">Understanding overfitting: an inaccurate meme in Machine Learning</a></span></li></ul></div><div><a data-attribute-index="21" href="https://lnkd.in/enzzZKWa" style="background: var(--artdeco-reset-base-background-transparent); border: var(--artdeco-reset-link-border-zero); box-sizing: inherit; color: var(--color-text-link-visited); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; line-height: inherit; margin: var(--artdeco-reset-base-margin-zero); padding: var(--artdeco-reset-base-padding-zero); position: relative; text-decoration: var(--artdeco-reset-link-text-decoration-none); touch-action: manipulation; vertical-align: var(--artdeco-reset-base-vertical-align-baseline);"><br /></a><br /></div></div>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-19056848319730026552022-10-04T12:10:00.005-07:002024-02-10T14:30:31.657-08:00Heavy-matter-wave and ultra-sensitive interferometry: An opportunity for quantum-gravity becoming an evidence based research<div><span style="font-family: arial;"><b><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/3/37/1919_eclipse_positive.jpg" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="800" data-original-width="623" height="320" src="https://upload.wikimedia.org/wikipedia/commons/3/37/1919_eclipse_positive.jpg" width="249" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span> Solar Eclipse of 1919 <br />(wikipedia)</span></td></tr></tbody></table><br />Preamble</b> </span><br /></div><div><span style="font-family: arial;"><br /></span></div><div><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td class="tr-caption" style="text-align: center;"><div><span><span style="font-family: arial;"> </span></span></div></td></tr></tbody></table><span style="font-family: arial;">Cool ideas in theoretical physics are ofter opaque for general reader whether if they are backed up with any experimental evidence in the real world. The success of <a href="https://en.wikipedia.org/wiki/LIGO">LIGO (Laser Interferometer Gravitational-wave Observatory) </a>definitely proven the value of interferometry for advancement of cool ideas of theoretical physics supported by real world measurable evidence. An other type of interferometry that could be used in testing multiple-different ideas from theoretical physics is called matter-wave interferometry or atom interferometry: It's been around decades but the new developments and increased sensitivity with measurement on heavy atomic system-waves will pave the technical capabilities to test multiple ideas of theoretical physics. </span></div><div><span style="font-family: arial;"><br /></span><b><span style="font-family: arial;">Basic mathematical principle of interferometry</span></b></div><div><b><span style="font-family: arial;"><br /></span></b><span style="font-family: arial;">Usually interferometry is explained with device and experimental setting details that could be confusing. However, one could explain the very principle without introducing any experimental setup. The basic idea of of interferometry is that if a simple wave, such as $\omega(t)=\sin\Theta(t)$, is first split into two waves and reflected over the same distance, one with shifted with a constant phase, in the vacuum without any interactions. A linear combination of the returned waves $\omega_{1}(t)=\sin \Theta(t)$ and $\omega_{2}(t)=\sin( \Theta(t) + \pi))$, will yield to zero, i.e., an interference pattern generated by $\omega_{1}(t)+\omega_{2}(t)=0$. This very basic principle can be used to detect interactions and characteristics of those interactions wave encounter over the time it travels to reflect and come back. Of course, the basic wave used in many interferometry experiments is the <i>laser light </i>and interaction we measure could be <i>gravitational wave </i>that interacts with the laser light i.e., LIGO's set-up.</span></div><div><span style="font-family: arial;"><br /></span><b><span style="font-family: arial;">Detection of matter-waves : What is heavy and ultra-sensitivity?</span></b></div><div><b><span style="font-family: arial;"><br /></span></b><span style="font-family: arial;">Each atomic system exhibits some quantum wave properties, i.e., matter waves. It implies a given molecular system have some wave signatures-characteristics which could be extracted in the experimental setting. Instead of laser light, one could use atomic system that is reflected similar to the basic principle. However, the primary difference is that increasing mass requires orders of magnitude more sensitive wave detectors for atomic interferometers. Currently heavy means usually above ~$10^{9}$ Da (comparing to Helium-4 which is about ~4 Da), these new heavy atomic interferometers might be able to detect gravitational-interactions within quantum-wave level due to precisions achieved ultra-sensitive. This sounds trivial but experimental connection to theories of quantum-gravity, one of the unsolved puzzles in theoretical-physics appears to be a potential break-through. One prominent example in this direction is entropic gravity and wave-function collapse theories. </span></div><div><span style="font-family: arial;"><br /></span><b><span style="font-family: arial;">Conclusion</span></b></div><div><b><span style="font-family: arial;"><br /></span></b><span style="font-family: arial;">Recent developments in heavy matter-wave interferometry could be leveraged for testing quantum-gravity arguments and theoretical suggestions. We try to bring this idea into general attention without resorting in describing experimental details. </span></div><div><span style="font-family: arial;"><br /></span><b><span style="font-family: arial;">Further Reading & Notes</span></b><ul style="text-align: left;"><li><span style="font-family: arial;"><a href="https://en.wikipedia.org/wiki/Dalton_(unit)">Dalton</a>, mass-unit used in matter-wave interferometry. </span></li><li><span style="font-family: arial;">Atom Interferometry by Prof. Pritchard </span><a href="https://www.youtube.com/watch?v=Ps8C_fWZmoY" style="font-family: arial;">YouTube</a><span style="font-family: arial;">.</span></li><li><span style="font-family: arial;"><a href="https://en.wikipedia.org/wiki/Schrödinger–Newton_equation">Newton-Schrödinger equation.</a></span></li><ul><li>Papers of <i><a href="https://scholar.google.com/citations?hl=en&user=MsAAX8IAAAAJ">Kingsley R. W. Jones</a> </i>are also very novel in this direction. </li></ul><li><span style="font-family: arial;">A roadmap for universal high-mass matter- wave interferometry </span><span style="font-family: arial;"> </span><span style="font-family: arial;">Kilka et. al. </span><span style="font-family: arial;">AVS Quantum Sci. </span><span style="font-family: arial;">4</span><span style="font-family: arial;">, 020502 (2022). </span><a href=" https://doi.org/10.1116/5.0080940" style="font-family: arial;">doi</a></li><ul><li><span style="font-family: arial;">Current capabilities as of 2022, atom interferometers can reach up to ~300 kDa.</span></li></ul><li><span style="font-family: arial;">Testing Entropic gravity, <a href="https://arxiv.org/abs/1612.00288">arXiv</a>. </span></li><li><span style="font-family: arial;">NASA early stage ideas workshops : <a href="https://web.archive.org/web/20150310023318/http://www.nasa.gov/content/nasa-early-stage-technology-workshop-astrophysics-heliophysics/">web-archive</a></span></li></ul></div>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-17523426706903991312022-09-20T12:36:00.005-07:002022-09-20T22:36:19.776-07:00Building robust AI systems: Is an artificial intelligent agent just a probabilistic boolean function? <p></p><div style="text-align: right;"><span style="font-weight: 700;"><br /></span></div><b>Preamble</b><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/c/ce/George_Boole_color.jpg" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="800" data-original-width="600" height="320" src="https://upload.wikimedia.org/wikipedia/commons/c/ce/George_Boole_color.jpg" width="240" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span> George Boole (Wikipedia)</span></td></tr></tbody></table><p></p><p>Agent, AI agent or an intelligent agent is used often to describe algorithms or AI systems that are released by research teams recently. However, the definition of an intelligent agent (IA) is a bit opaque. Naïvely thinking, it is nothing more than a decision maker that shows some intelligent behaviour. However, <i>making a decision intelligently </i>is hard to quantify computationally, and probably IA for us is something that can be representable as a Turing machine. Here, we argue that an intelligent agent in the current AI systems should be seen as a function without side effects outputting a boolean output and shouldn't be extrapolated or compare to human level intelligence. Causal inference capabilities should be seen as a scientific guidance to this function decompositions without side-effects, i.e., Human in-the loop Probabilistic Boolean Functions (PBFs).</p><p><b>Computational learning theories are based on binary learners</b></p><p>Two of the major theories of statistical learning PAC and VC dimensions build upon on "binary learning". </p><p>PAC stands for Probably Approximately Correct, It sets basic framework and mathematical building blocks for defining a machine learning problem from complexity theory. Probably correct implies finding a weak learning function given binary instance set $X=\{1,0\}^{n}$. The binary set or its subsets mathematically called concepts and under certain mathematical conditions a system said to be PAC learnable. There are equivalences to VC and other computation learning frameworks. </p><p><b>Robust AI systems: Deep reinforcement learning and PAC</b></p><p>Even though the theory of learning on deep (reinforcement) learning is not established and active area of research. There is an intimate connection with composition of <i>concepts, i.e., binary instance subsets </i>as almost all operations within deep RL can be viewed as probabilistic Boolean functions (PBFs). </p><p><b>Conclusion</b> </p><p>Current research and practice in robust AI systems could focus on producing learnable probabilistic boolean functions (PBFs) as intelligent agents, rather than being a human level intelligent agents. This modest purpose might bring more practical fruit than long-term aims of replacing human intelligence. Moreover, theory of computation for deep learning and causality could benefit from this approach. </p><p><b>Further reading</b></p><p></p><ul style="text-align: left;"><li><a href="https://web.mit.edu/6.435/www/Valiant84.pdf">Valiant84</a>. Theory of Learnable.</li><li><a href="https://en.wikipedia.org/wiki/Vapnik–Chervonenkis_dimension">VC Dimension</a>.</li><li>Modern Theory and Machine Learning, Chase-Freitag, 2018</li></ul><p></p><p><br /></p>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-60680662603990722872022-07-05T11:32:00.000-07:002024-02-29T10:53:40.743-08:00Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra<p><b>Preamble</b></p><p><span data-preserver-spaces="true" style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"></span></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/d/da/Alice_par_John_Tenniel_02.png" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="800" data-original-width="523" height="320" src="https://upload.wikimedia.org/wikipedia/commons/d/da/Alice_par_John_Tenniel_02.png" width="209" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span> The White Rabbit <br />(Wikipedia)</span></td></tr></tbody></table><span data-preserver-spaces="true" style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"><span data-preserver-spaces="true" style="margin-bottom: 0pt; margin-top: 0pt;"></span></span><p></p><div style="text-align: left;"><span data-preserver-spaces="true" style="margin-bottom: 0pt; margin-top: 0pt;">A novice analyst or even experienced (data) scientist would have thought that the bar notation $|$ in representing conditional probability carries some different operational mathematics. Primarily when written in explicit distribution functions $p(x|y)$. Similar approach applies to joint probabilities such as $p(x, y)$ too. One could see a mixture of these, such as $p(x, y | z)$. In this short exposition, we clarify that none of these </span><em style="margin-bottom: 0pt; margin-top: 0pt;"><span data-preserver-spaces="true" style="margin-bottom: 0pt; margin-top: 0pt;">identifications</span></em><span data-preserver-spaces="true" style="margin-bottom: 0pt; margin-top: 0pt;"> within arguments of probability do have any different </span><em style="margin-bottom: 0pt; margin-top: 0pt;"><span data-preserver-spaces="true" style="margin-bottom: 0pt; margin-top: 0pt;">resulting </span></em><span data-preserver-spaces="true" style="margin-bottom: 0pt; margin-top: 0pt;">operational meaning. </span></div><p></p><p><span style="color: #0e101a;"><b>Arguments in probabilities: </b></span><b style="caret-color: rgb(14, 16, 26); color: #0e101a;">Boolean statement and </b><b><span style="color: #0e101a;"><span style="caret-color: rgb(14, 16, 26);">filtering</span> </span></b></p><p><span data-preserver-spaces="true" style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;">Arguments in any probability are </span><em style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;">mathematical statements </em><span data-preserver-spaces="true" style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;">of discrete mathematics that correspond to </span><em style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;">events</em><span data-preserver-spaces="true" style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"> in the experimental setting. These are statements declaring some facts with a boolean outcome. These statements are queries to a data set. Such as, if the temperature is above $30$ degrees, $T > 30$. Temperature $T$ is a random variable. Unfortunately, the term random variable is often used differently in many textbooks. It is defined as a mapping rather than as a single variable. The bar $|$ in conditional probability $p(x|y)$, implies statement $x$ given that statement $y$ has already occurred, i.e., if. This interpretation implies that $y$ first occurred before $x$, but it doesn't imply that they are causally linked. The condition plays a role in filtering, a </span><em style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;">where</em><span data-preserver-spaces="true" style="color: #0e101a; margin-bottom: 0pt; margin-top: 0pt;"> clause in query languages. $p(x|y)$ boils down to $p_{y}(x)$, where the first statement $y$ is applied to the dataset before computing the probability on the remaining statement $x$.</span></p><p>In the case of joint probabilities $p(x, y)$, events co-occur, i.e., AND statement. In summary, anything in the argument of $p$ is written as a mathematical statement. In the case of assigning a distribution or a functional form to $p$, there is no particular role for conditionals or joints; the modelling approach sets an appropriate structure.</p><p><span style="color: #0e101a;"><b>Conditioning does not imply casual direction: do-Calculus do</b></span></p><p><span style="color: #0e101a;">A filtering interpretation of conditional $p(x|y)$ does not imply causal direction, but $do$ operator does, $p(x|do(y))$. </span></p><p><b><span style="color: #0e101a;">Non-commutative algebra: When frequentist are equivalent to</span><span style="color: #0e101a;"> Bayesian</span></b></p><p><span style="color: #0e101a;">Most of the simple filtering operations would result in identical results if reversed. $p(x|y) = p(y|x)$, prior being equal to posterior. This remark implies we can't apply Bayesian learning with commutative statements. We need non-commutative statements; as a result, one can do Bayesian learning with the newly arriving data, i.e., the arrival of new subjective evidence. The reason seems to be due to the frequentist nature of filtering.</span></p><p><span style="color: #0e101a;"><b>Outlook</b> </span></p><p><span style="color: #0e101a;">Even though we provided some revelations on decoding the operational meaning of conditional probabilities, we suggested that any conditional, joint or any combination of these within the argument of probabilities has no operational purpose other than pre-processing steps. However, the philosophical and practical implications of probabilistic reasoning are always counterintuitive. Probabilistic reasoning is a complex problem computationally. From a causal inference perspective, we are better equipped to tackle these issues with do-Bayesian analysis. </span></p><p><span style="color: #0e101a;"><b>Further reading</b></span></p><p></p><ul style="text-align: left;"><li><a href="http://discrete.openmathbooks.org/dmoi3.html">Discrete Mathematics</a>, Oscar Levin</li><li><a href="https://www.wiley.com/en-us/Causal+Inference+in+Statistics%3A+A+Primer-p-9781119186847">Causal Inference in Statistics, A Primer</a>, Judea Pearl, Madelyn Glymour, Nicholas P. Jewell</li><li><a href="https://plato.stanford.edu/entries/conditionals/">Indicative Conditionals </a>Stanford Encyclopedia of Philosophy </li><li><a href="https://plato.stanford.edu/entries/epistemology-bayesian/">Bayesian Epistemology</a> Stanford Encyclopedia of Philosophy </li></ul><div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif;"><span style="font-family: inherit;"><span style="font-size: x-small;">Please Cite as:</span></span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif;"><span style="font-family: inherit;"><span style="font-size: x-small;"><br /></span></span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif;"><span style="font-family: inherit;"><span style="font-size: x-small;"> @misc{suezen22brh, </span></span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif;"><span style="font-size: x-small;"><span style="font-family: inherit;"> title = {Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra</span><span style="font-family: inherit;">}, </span></span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif;"><span style="font-size: x-small;"><span><span style="font-family: inherit;"> howpublished = {\url{</span></span><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.52); font-family: Roboto, RobotoDraft, Helvetica, Arial, sans-serif;">https://science-memo.blogspot.com/2022/07/bayesian-conditional-noncommutative.html</span><span style="font-family: inherit;">}}, </span></span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif;"><span style="font-family: inherit;"><span style="font-size: x-small;"> author = {Mehmet Süzen},</span></span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif;"><span style="font-family: inherit;"><span style="font-size: x-small;"> year = {2022}</span></span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif;"><span style="font-size: x-small;"><span style="font-family: inherit;">}</span> </span></div></div><div><span style="font-size: x-small;"><br /></span></div><p></p>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-45221655395937195302022-06-20T10:01:00.005-07:002023-09-21T12:30:04.277-07:00 Empirical risk minimization is not learning : A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning<p style="text-align: left;"><b></b></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/3/33/Nelder-Mead_Simionescu.gif" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="800" data-original-width="800" height="320" src="https://upload.wikimedia.org/wikipedia/commons/3/33/Nelder-Mead_Simionescu.gif" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span> </span>Simionescu Function (Wikipedia)</td></tr></tbody></table><b><br />Preamble</b><p></p><p></p><div style="text-align: left;">The holy grail of machine learning appears to be the <i><a href="https://en.wikipedia.org/wiki/Empirical_risk_minimization">empirical risk minimisation</a></i>. However, on the contrary to general dogma, the primary objective of machine learning is not <i>risk minimisation per se </i>but mimicking human or <a href="https://www.cs.rhul.ac.uk/~chrisw/">animal learning</a>. Empirical risk minimisation is just a snap-shot in this direction and is part of a learning measure, not the primary objective.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Unfortunately, all current major machine learning libraries are implementing empirical risk minimisation as primary objective, so called a training, manifest as usually <span style="font-family: courier;">.fit. </span>Here we provide a mathematical definition of learning in the language of empirical risk minimisation and its implications on two very important concepts, overfitting and Occam's razor.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">Our exposition is still informal but it should be readable for experienced practitioners.</div><p></p><p style="text-align: left;"><b>Definition: Empirical Risk Minimization</b></p><p style="text-align: left;">Given set of $k$ observation $\mathscr{O} = \{o_{1}, ..., o_{k} \}$ where $o_{i} \in \mathbb{R}^{n}$, $n$-dimensional vectors. Corresponding labels or binary classes, the set $\mathscr{S} = \{ s_{1}, .., s_{k}\}$, with $s_{i} \in \{0,1\}$ is defined. A function $g$ maps observations to classes $g: \mathscr{O} \to \mathscr{S}$. An error function (or loss) $E$ measures the error made by the estimated map function $\hat{g}$ compare to true map function $g$, $E=E(\hat{g}, g)$. The entire idea of supervised machine learning boils down to minimising a functional called ER (Empirical Risk), here we denoted by $G$, it is a functional, meaning is a function of function, over the domain $\mathscr{D} = Tr(\mathscr{O} x \mathscr{S})$ in discrete form, $$ G[E] = \frac{1}{k} {\Large \Sigma}_{\mathscr{D} } E(\hat{g}, g) $$. This is so called a training a machine learning model, or an estimation for $\hat{g}$. However, testing this estimate on the new data is <i>not</i> the main purpose of the learning.</p><p style="text-align: left;"><b>Definition: Learning measure </b></p><p style="text-align: left;">A learning measures $M$, on $\hat{g}$ is defined over set of $l$ observations with increasing size, $\Theta = \{ \mathscr{O}_{1}, ..., \mathscr{O}_{l}\}$ whereby size of each set is monotonically higher, meaning that $ | \mathscr{O}_{1}| < | \mathscr{O}_{2}| , ...,< | \mathscr{O}_{l}|$.</p><p style="text-align: left;"><b>Definition: Empirical Risk Minimization with a learning measure (ERL)</b></p><p style="text-align: left;">Now, we are in a position to reformulate ER with learning measure, we call this ERL. This come with a testing procedure.</p><p style="text-align: left;">If empirical risks $G[E_{j}]$ lowers monotonically, $ G[E_{1}] > G[E_{2}] > ... > G[E_{l}]$, then we said the functional form of $\hat{g}$ is a learning over the set $\Theta$. </p><p style="text-align: left;"><b>Functional form of $\hat{g}$ : Inductive bias</b></p><p style="text-align: left;">The functional form implies a model selection, and a technical term of this also known as<a href="https://en.wikipedia.org/wiki/Inductive_bias"> inductive bias </a>with other assumptions, meaning the selection of complexity of the model, for example a linear regression or nonlinear regression.</p><p style="text-align: left;"><b>Re-understanding of overfitting and Occam's razor from ERL perspective </b></p><p style="text-align: left;">If we have two different ERLs on $\hat{g}^{1}$ and $\hat{g}^{2}$. Then overfitting is a comparison problem between monotonically increasing empirical risks. If model, here an inductive bias or a functional form, over learning measure, we select the one with "higher monotonicity" and the less complex one and call the other overfitted model. Complexity here boils down to functional complexity of $\hat{g}^{1}$ and $\hat{g}^{2}$ and overfitting can only be tested with two models over monotonicity (increasing) of ERLs.</p><p style="text-align: left;"><b>Conclusions</b></p><p style="text-align: left;">In the age of deep learning systems, the classical learning theory needs an update on how do we define what is <span style="font-family: inherit;">learning beyond a single shot fitting exercise. A first step in this direction would be to improve upon basic definitions of Empirical Risk (ER) minimisation that would reflect real-life learning systems similar to forgetting mechanism proposed by <a href="https://en.wikipedia.org/wiki/Hermann_Ebbinghaus">Ebbinghaus</a>. This is consistent with <a href="https://www.cs.cmu.edu/~tom/">Tom Mitchell's </a>definition of operational machine learning. <a href="http://bayes.cs.ucla.edu/jp_home.html">A next level would be to add </a></span><a href="http://bayes.cs.ucla.edu/jp_home.html">causality in the definition.</a></p><div><span style="font-family: inherit;">Please cite as follows:</span></div><div><span style="font-family: inherit;"><u><br /></u></span></div><div><span style="font-family: inherit;"> @misc{suezen22erm, </span></div><div><span style="font-family: inherit;"> title = { Empirical risk minimization is not learning : A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning}, </span></div><div><span style="font-family: inherit;"> howpublished = {\url{</span>http://science-memo.blogspot.com/2022/06/empirical-risk-minimisation-learning-curve.html<span style="font-family: inherit;">}}, </span></div><div><span style="font-family: inherit;"> author = {Mehmet Süzen},</span></div><div><span style="font-family: inherit;"> year = {2022}</span></div><p><span style="font-family: inherit;"></span></p><div><span style="font-family: inherit;">}</span> </div><p style="text-align: left;"><span style="font-family: inherit;"><b>Postscript Notes</b></span></p><p style="text-align: left;">Following notes are added after initial release </p><p style="text-align: left;"><span style="font-family: inherit;"><b>Postscript 1: Understanding overfitting as comparison of inductive biases</b></span></p>
<p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;">ERM could be confusing for even experienced researchers. It is indeed about risk measure. </span><span style="font-family: inherit;">We measure the risk of a model, i.e., machine learning procedure that how much error would </span><span style="font-family: inherit;">it make on the</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">given new data distribution, as in risk of investing. This is quite a similar</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">notion as in financial risk of loss but not explicitly stated.</span><span style="font-family: inherit;"> </span></p>
<p style="font-stretch: normal; line-height: normal; margin: 0px; min-height: 15px;"><span style="font-family: inherit;"><br /></span></p>
<p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;">Moreover, a primary objective of machine learning is not ERM but measure learning curves </span><span style="font-family: inherit;">and pair-wise comparison of</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">inductive biases, avoiding overfitting.</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">An inductive bias,</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">here we restrict the concept as in model</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">type,</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">is a model selection step: different </span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">parametrisation of the same model are still the same inductive bias.</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">That’s why</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">standard training-error learning curves can’t be used to detect overfitting alone. </span></p><p style="text-align: left;"><span style="font-family: inherit;"><b>Postscript 2: Learning is not to optimise: Thermodynamic limit, true risk and accessible learning space</b></span></p>
<p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;">True risk minimisation in machine learning is not possible, instead we </span><span style="font-family: inherit;">rely on ERM, i.e., Emprical Risk Minimisation.</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">However, the purpose of</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">machine learning algorithm is not to minimise risk, as we only have</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">a</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">partial knowledge about the reality through data.</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">Learning implies</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">finding out a region</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">in accessible learning space whereby there is a</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">monotonic increase in the objective; ERM is only a single point on this space,</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">the concept rooted in German scientist Hermann Ebbinghaus</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">work on memory.</span></p>
<p style="font-stretch: normal; line-height: normal; margin: 0px; min-height: 15px;"><span style="font-family: inherit;"><br /></span></p>
<p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;">There is an intimate connection to thermodynamic limit and true risk in this direction </span><span style="font-family: inherit;">as an open research.</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">However, it doesn’t imply infinite limit of data, but the observable’s </span><span style="font-family: inherit;">behaviour. That’s why full empiricist</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">approaches usually requires a complement of</span><span style="font-family: inherit;"> a</span><span style="font-family: inherit;"> physical laws,</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">such as Physics Informed Neural Networks (PINNs) or</span><span style="font-family: inherit;"> </span><span style="font-family: inherit;">Structural Causal Model (SCM).</span></p><p style="text-align: left;"><b>Postscript 3: <span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); font-size: 14px;">Missing abstraction in modern machine learning libraries</span><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); font-size: 14px;"> </span></b></p><div style="text-align: left;"><span style="font-family: inherit;"><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">Interestingly current modern machine learning libraries stop </span><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">abstracting further than fitting: .fit and .predict. This is short </span><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">of learning as in machine learning. Learning manifest itself </span><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">In learning curves. .learn functionality can be leveraged beyond </span><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">fitting and if we are learning via monotonically increasing </span><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">performance. Origin of this lack of tools for .learn appears </span><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">to be how Empirical Risk Minimisation (ERM) is formulated </span><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);">on a single task.</span></span></div>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.comtag:blogger.com,1999:blog-4550553973032503669.post-35734665231825029132022-05-11T11:19:00.002-07:002024-02-03T03:15:56.757-08:00A misconception in ergodicity: Identify ergodic regime not ergodic process<p><b>Preamble</b> </p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjfpRBBYwJ0Eg-28WKbLMqj9e5sYSqcdROMlqGhZiNnlSU_S114gsJWDJUmlJCsnF4Ztdrrb3vi7A8-UK9wxnbabAopy9CW6_DgZdDvj-AANOrGuO-hcDVd-jFEqTfQ3hA71_IcX6rqB5qxIl2cWr-DNi1qvJ8SOaoflhD8zxCrJci_1Z5ptKPYJWAucA/s413/ergodic_regime_approach.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="301" data-original-width="413" height="233" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjfpRBBYwJ0Eg-28WKbLMqj9e5sYSqcdROMlqGhZiNnlSU_S114gsJWDJUmlJCsnF4Ztdrrb3vi7A8-UK9wxnbabAopy9CW6_DgZdDvj-AANOrGuO-hcDVd-jFEqTfQ3hA71_IcX6rqB5qxIl2cWr-DNi1qvJ8SOaoflhD8zxCrJci_1Z5ptKPYJWAucA/s320/ergodic_regime_approach.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span> Figure 1: Two observable's approach to<br /> ergodicity for Bernoulli Trials. </span></td></tr></tbody></table><a href="http://science-memo.blogspot.com/2020/01/a-practical-understanding-of-ergodicity.html">Ergodicity</a> appears in many fields, in physics, chemistry and natural sciences but in economics to machine learning as well. Recall that, ergodicity in physics and mathematical definition diverges significantly due to <a href="http://science-memo.blogspot.com/2014/05/is-ergodicity-reasonable-hypothesis.html">Birkhoff's statistical definition against Boltzmann's physical approach</a>. Here we will follow Birkhoff's definition of ergodicity which is a statistical one. The basic notion of ergodicity is confusing even among experienced academic circles. The primary misconception is that ergodicity is attributed to a process, a given process being ergodic. We address this by pointing out that ergodicity appears as a regime or a window so to speak for a given process's time-evolution and it can't be attributed to an entire generating process. <span style="text-align: center;"> </span><p></p><p><b>No such thing as ergodic process but ergodic regime given observable</b></p><p>A process being ergodic is not entirely true identification. Ergodicity is a regime over a given time window for a given observable derived from the process. This is the basis of ensemble theory from statistical physics. Most of the processes generates initially a non-ergodic regime given an observable. In order to identify an ergodic regime, we need to define for a discrete setting : </p><p></p><ol style="text-align: left;"><li>the ensemble (sample space) : In discrete dynamics we also have an alphabet that ensemble is composed of.</li><li>an observable defined over the sample space.</li><li>a process (usually dynamics on the sample space evolving over-time).</li><li>a measure and threshold to discriminate ergodic to non-ergodic regimes. </li></ol>Interesting thing is that different observables on the same ensemble and the process may generate different ergodic regimes. <p></p><p><b> What are the processes and regime mathematically?</b></p><p>A process is essentially <a href="https://en.wikipedia.org/wiki/Dynamical_system">a dynamical system mathematically.</a> this includes <a href="https://en.wikipedia.org/wiki/Stochastic">stochastic</a> models and as well as deterministic systems sensitive to initial conditions. Prominently these both combined in Statistical Physics. A regime mathematically implies a range of parameters or a time-window that a system behaves very differently. </p><p><b> Identification of ergodic regime</b></p><p></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhl24erp7eerJLdT-sE5OhSMEViT2QO3R5ba80v-6cV1XLI75oxIHnKNO83HgcL_s5v_PXcLu5BHQsHsH2CJIkMhLmw9RSwjnp3IwAKEuFhIjCTlJ3-FzhR5GzfdcqUVLugPBOodIkUCz7PhmGFJlbKOc2RuXNkD31QXm_nWMBfQh-jnHnmEEk_TrPBlg/s392/ergodic_regime_or_on_site.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="297" data-original-width="392" height="242" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhl24erp7eerJLdT-sE5OhSMEViT2QO3R5ba80v-6cV1XLI75oxIHnKNO83HgcL_s5v_PXcLu5BHQsHsH2CJIkMhLmw9RSwjnp3IwAKEuFhIjCTlJ3-FzhR5GzfdcqUVLugPBOodIkUCz7PhmGFJlbKOc2RuXNkD31QXm_nWMBfQh-jnHnmEEk_TrPBlg/s320/ergodic_regime_or_on_site.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure 2: Evolution of <br />time-averaged OR observable.</td></tr></tbody></table>The main objective of finding out if dynamics produced by the process on our observable enters or in an ergodic regime is to measure if ensemble-averaged observable is <i><u>equivalent</u></i> to time-averaged observable value. Here equivalence is a difficult concept to address quantitatively. The simplest measure would be to check if $\Omega = \langle A \rangle_{ensemble} - \langle A \rangle_{time}$ is close to zero, i.e., vanishing. $ \Omega$ being the ergodicity measure and $A$ is the observable with different averaging procedure. This is the definition we will use here. However, beware that in the physics literature there are more advanced measures to detect ergodicity, such as considering <a href="https://journals.aps.org/pre/abstract/10.1103/PhysRevE.90.032141">diffusion-like behaviour</a>, meaning that the transition from non-ergodic to ergodic regime is not abrupt but have a diffusing approach to ergodicity. <p></p><p></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhv_xOh-XlZY2PNrCJoGrxYTLZL7K4L0ygByX2N-7wtJ3gFNwzEZtpAvK3rwugynuy8utSnnFz0TTtsSWNCQ0OCJfGWRxW4_v_jojH1WptJSuafaMDb2CXXNWDcyKX49YcmUr0ycxrmgLUujuF27NRAPrOqDsJKoiyhM3KvZ864WkzXCkIC3uZiipMpbA/s392/ergodic_regime_average_on_site.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="298" data-original-width="392" height="243" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhv_xOh-XlZY2PNrCJoGrxYTLZL7K4L0ygByX2N-7wtJ3gFNwzEZtpAvK3rwugynuy8utSnnFz0TTtsSWNCQ0OCJfGWRxW4_v_jojH1WptJSuafaMDb2CXXNWDcyKX49YcmUr0ycxrmgLUujuF27NRAPrOqDsJKoiyhM3KvZ864WkzXCkIC3uZiipMpbA/s320/ergodic_regime_average_on_site.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure 3: Evolution of <br />time-averaged mean.<br /><br /></td></tr></tbody></table>In some other academic fields <i>approach to ergodic regime</i> has different names not strictly but closely related, such as in chemical physics or molecular dynamics, <i>equilibration time</i>, <i>relaxation time, equilibrium, steady-state</i> for a given observable, in statistics Monte Carlo simulations, it is usually called <i>burn out </i>period. <u>Not always</u>, but in ergodic regime, observable is stationary and time-independent. In Physics, this is much easier to distinguish because time-dependence, equilibrium and stationarity are tied to energy transfer to the system. <p></p><p><b>Ergodic regime not ergodic process : An example of Bernoulli Trials</b></p><p>Apart from real physical processes such as Ising Model, a basic process we can use to understand how ergodic regime could be detected using Bernoulli Trials. </p><p>Here for a Bernoulli trials/process, we will use random number generators for a binary outcome, i.e., RNG Marsenne-Twister to generate time evolution of an observables on two sites: Let's say we have two sites $x, y \in \{1, 0\}$. The ensemble of this two site system $xy$ is simply the sample space of all possible outcomes $S=\{10, 11, 01, 00\}$. Time evolution of such two site system is formulated here as choosing $\{0,1\}$ for a given site at a given time, see Appendix Python notebook. </p><p>Now, the most important part of checking ergodic regime is that we need to define an observable over two side trials. We denote two observable as $O_{1}$, which is an OR operation between sites, and $O_{2}$ is averaged over two sites. Since our sample space is small, we can compute the ensemble average observables analytically:</p><p></p><ul style="text-align: left;"><li>$O_{1} = (x+y)/2$ then $10, 11, 01, 00 ; (1/2 + 2/2 + 1/2 + 0 ) /4 = 0.5$</li><li>$O_{2} = x OR y$ then $10, 11, 01, 00 ; ( 1 + 1 + 1 + 0 )/4 = 0.75$ </li></ul><p></p><p>We can compute the time-averaged observables over time via simulations, but their formulation are know as follows: </p><p></p><ul style="text-align: left;"><li> Time average for $O_{1}$ at time $t$ (current step) is $ \frac{1}{t} \sum_{i=0}^{t} (x_{i}+y_{i})/2.0$</li><li>Time average for $O_{2}$ at time $t$ (current step) is $ \frac{1}{t} \sum_{i=0}^{t} (x_{i} OR y_{i})$.</li></ul><p></p><p>One of the possible trajectories are shown in Figure 2 and 3. For approach to ergodicity measure, we shown this at Figure 1. Even though, we should run multiple trajectories to have error estimates, we can clearly see that ergodicity regime starts after 10K steps, at least. Moreover, different observables have different decay rates to ergodic regime. From preliminary simulation, it appears to be OR observable converges slower, though this is a single trajectory.</p><p><b>Conclusion</b></p><p>We have shown that manifestation of the <i>ergodic regime </i>depends on the time-evolution of the observable given a measure of ergodicity, i.e., a condition how ergodicity is detected. This exposition should clarify that a generating process does not get an attribute of "ergodic process" rather we talk about "ergodic regime" depending on observable and the process over temporal evolution. Interestingly, from Physics point of view, it is perfectly possible that an observable attains ergodic regime and then falls back to non-ergodic regime.</p><p><b>Further reading</b></p><p></p><ul style="text-align: left;"><li><a href="http://science-memo.blogspot.com/2020/01/a-practical-understanding-of-ergodicity.html">Practical Understanding of Ergodicity</a> : Elementary ergodicity and some basic references.</li><li><a href="http://science-memo.blogspot.com/2014/05/is-ergodicity-reasonable-hypothesis.html">Is ergodicity a reasonable hypothesis? </a> : Boltzmann's definition of ergodicity.</li><li><a href="https://arxiv.org/abs/0904.3122">Scaling of ergodicity in binary systems</a> : An idea on extending Bernoulli trials ergodicity to N-dimensions (sites).</li><li><a href="https://journals.aps.org/pre/abstract/10.1103/PhysRevE.90.032141">Effective ergodicity in single-spin-flip dynamics</a> PRE Approach to ergodicity in magnetic systems, extents to neural networks. </li><li><a href="https://cran.r-project.org/web/packages/isingLenzMC/vignettes/isingLenzMC.pdf">IsingLenzMC R package</a> : Effective ergodicity convergence R utilities and Ising-Lenz 1-D Monte Carlo.</li><li><a href="https://arxiv.org/abs/1606.08693">Diffusive behaviour of ergodicity convergence in Ising-Model.</a></li></ul><p></p><p><b>Appendix: Code</b></p><p>Bernoulli Trial example we discussed is available as a Python notebook on github <a href="https://github.com/msuzen/scientificMemo/blob/master/ergodicRegime/regime_ergodic.ipynb">here</a>. </p><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;">Please cite as follows:</span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"><u><br /></u></span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> @misc{suezen22ergoreg, </span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> title = {</span>A misconception in ergodicity: Identify ergodic regime not ergodic process<span style="font-family: inherit;">}, </span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> howpublished = {\url{</span>http://science-memo.blogspot.com/2022/05/ergodic-regime-not-process.html<span style="font-family: inherit;">}, </span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> author = {Mehmet Süzen},</span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> year = {2022}</span></div><div style="caret-color: rgb(51, 51, 51); font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;">}</span> </div>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-45699223323940884952022-02-11T10:48:00.008-08:002023-12-08T13:45:02.549-08:00 Physics origins of the most important statistical ideas of recent times<table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/2/21/Maxwell's_letters_plate_IV.jpg" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><span style="font-family: inherit;"><img border="0" data-original-height="539" data-original-width="418" height="320" src="https://upload.wikimedia.org/wikipedia/commons/2/21/Maxwell's_letters_plate_IV.jpg" width="248" /></span></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="font-family: inherit;">Figure: Maxwell's handwritings, <br />state diagram (Wikipedia)</span></td></tr></tbody></table><div class="separator"><a href="https://upload.wikimedia.org/wikipedia/commons/2/21/Maxwell's_letters_plate_IV.jpg" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><span style="font-family: inherit;"><br /></span></a><a href="https://upload.wikimedia.org/wikipedia/commons/2/21/Maxwell's_letters_plate_IV.jpg" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><span style="font-family: inherit;"><br /></span></a></div><span style="font-family: inherit;"><b>Preamble</b><br /><b><br /></b>The modern statistics now move into an emerging field called <a href="https://www.tandfonline.com/doi/full/10.1080/10618600.2017.1384734">data science</a> that amalgamate many different fields from <a href="https://www.usgs.gov/advanced-research-computing/what-high-performance-computing">high performance computing </a>to <a href="https://en.wikipedia.org/wiki/Control_theory">control engineering</a>. However, the emergent behaviour from researchers in machine learning and statistics that, sometimes <i>they omit naïvely</i> and probably <i>unknowingly</i> the fact that some of the most important ideas in data sciences are actually originated from Physics discoveries and specifically developed by physicist. In this short exposition we try to review these physics origins on the areas defined by Gelman and Vehtari (<a href="https://doi.org/10.1080/01621459.2021.1938081">doi</a>). Additional section is also added in other possible areas that are currently the focus of active research in data sciences. <br /><br /><b>Bootstrapping and simulation based inference : Gibbs's Ensemble theory and Metropolis's simulations</b><br /><b><br /></b></span><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; text-align: left;"><tbody><tr><td class="tr-caption" style="text-align: center;"><blockquote style="border: medium; margin: 0px 0px 0px 40px; padding: 0px;"></blockquote><span style="font-family: inherit;"><br /></span></td></tr></tbody></table><div style="text-align: left;"><span style="font-family: inherit;">Bootstrapping is a novel idea of estimations with uncertainty with given set of samples. It is mostly popularised by <a href="https://projecteuclid.org/journals/annals-of-statistics/volume-7/issue-1/Bootstrap-Methods-Another-Look-at-the-Jackknife/10.1214/aos/1176344552.full">Efron</a> and his contribution is immense, making this tool available to all researchers doing quantitative analysis. However, the origins of bootstrapping can be traced back to the idea of <a href="https://en.wikipedia.org/wiki/Ensemble_(mathematical_physics)">ensembles</a> in statistical physics, which is introduced by <a href="https://en.wikipedia.org/wiki/Josiah_Willard_Gibbs">J. Gibbs</a>. The ensembles in physics allow us to do just what bootstrapping helps, estimating a quantity of interest with sub-sampling, in the case of statistical physics this appears as sampling a set of different microstates. Using this idea Metropolis devised a inference in <a href="https://en.wikipedia.org/wiki/Equation_of_State_Calculations_by_Fast_Computing_Machines">1953</a>, to compute ensemble averages for liquids using computers. Note that, usage of Monte Carlo approach for pure mathematical nature, i.e., solving integrals, appear much earlier with von Neumann's efforts.</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: left;"><b><span style="font-family: inherit;">Causality : Hamiltonian systems to Thermodynamic potentials</span></b></div><div style="text-align: left;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: left;"><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fb/Thermodynamic_square.svg/480px-Thermodynamic_square.svg.png" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><span style="font-family: inherit;"><img border="0" data-original-height="480" data-original-width="480" height="200" src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/fb/Thermodynamic_square.svg/480px-Thermodynamic_square.svg.png" width="200" /></span></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="font-family: inherit;">Figure: Maxwell <br />Relations as causal <br />diagrams.</span></td></tr></tbody></table><span style="font-family: inherit;">Even though the historical roots of causal analysis in early 20th century attributed to <a href="https://academic.oup.com/genetics/article/8/3/239/6046336">Wright 1923 </a>for his definition of path analysis, causality was the core tanents of Newtonian mechanics in distinguishing left and right of the equations of motions in the form of differential equations, and the set of differential equations following that with Hamiltonian Mechanics is actually forms a graph, i.e., relationships between generalised coordinates, momentum and positions. This connection is never acknowledge in early statistical literature, and probably causal constructions from classical physics were not well known in that community or did not find its way to data-driven mechanics. Similarly, causal construction of <a href="https://en.wikipedia.org/wiki/Thermodynamic_potential">thermodynamic potentials </a>appear as a directed graph as in, Born wheel. It appears as a mnemonic but it is actually causally constructed via<a href="https://en.wikipedia.org/wiki/Legendre_transformation"> Legendre Transformations</a>. Of course, causality, philosophically speaking, is discussed since Ancient Greece but here we restrict the discussion on solely quantitative theories after Newton.</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: left;"><b><span style="font-family: inherit;">Overparametrised models and regularisation : Poincaré classifications and astrophysical dynamics</span></b></div><div style="text-align: left;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: left;"><span style="font-family: inherit;">The current deep learning systems classified as massively overparametrized systems. However, the lower dimensional understanding of this phenomenon were well studied by Poincare's classification of classical dynamics, namely the measurement problem of having overdetermined system of differential equations, i.e., whereby inverse problems are well known in astrophysics and theoretical mechanics. </span></div><div style="text-align: left;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: left;"><b><span style="font-family: inherit;">High-performance computing: Big-data to GPUs</span></b></div><div style="text-align: left;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: left;"><span style="font-family: inherit;">Similarly, using supercomputers or as now we call it high-performance computation with big data generating processes were actually can be traced back to Manhattan project and ENIAC that aims solving scattering equations and almost 50 years of development on this direction before 2000s. </span></div><div style="text-align: left;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: left;"><b><span style="font-family: inherit;">Conclusion</span></b></div><div style="text-align: left;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: left;"><span style="font-family: inherit;">The impressive development of new emergent field of data science as a larger perspective of statistics into computer science have strong origins from core Physics literature and research. These connections are not sufficiently cited or acknowledged. Our aim in this short exposition is to bring these aspects into the attention of data science practitioners and researchers alike.</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: left;"><b><span style="font-family: inherit;">Further reading</span></b></div><div style="text-align: left;"><span style="font-family: inherit;">Some of the mentioned works and related reading list, papers or books.</span></div><div style="text-align: left;"><div><span style="color: #333333;"><span style="caret-color: rgb(51, 51, 51); font-family: inherit;"><br /></span></span></div><ul style="text-align: left;"><li><span style="caret-color: rgb(51, 51, 51); color: #333333;"><span style="font-family: inherit;"><a href="https://www.tandfonline.com/doi/full/10.1080/01621459.2021.1938081">What are the Most Important Statistical Ideas of the Past 50 Years? Gelman & Vehtari (2021)</a></span></span></li><li><a href="https://www.jstor.org/stable/2685844"><span style="font-family: inherit;"><span style="letter-spacing: calc(var(--pharos-type-scale-8) * -0.01);">Leisurely Look at the Bootstrap, the Jackknife, and Cross-Validation, </span><span style="letter-spacing: -0.32px;">Bradley Efron and Gail Gong (1983)</span></span></a></li><li><span style="caret-color: rgb(51, 51, 51); color: #333333;"><a href="https://en.wikipedia.org/wiki/Elementary_Principles_in_Statistical_Mechanics"><span style="font-family: inherit;">Elementary Principles in Statistical Mechanics, Gibbs (1902)</span></a></span></li><li><span style="font-family: inherit;"><a href="https://en.wikipedia.org/wiki/Equation_of_State_Calculations_by_Fast_Computing_Machines">Equation of State Calculations by Fast Computing Machines, Metropolis et. al. (1953)</a></span></li><li><span style="font-family: inherit;"><a href="https://iopscience.iop.org/article/10.1088/0305-4470/24/2/004/meta">Generalized statistical mechanics: connection with thermodynamics, Curado-Tsallis (1992)</a></span></li><li><span style="font-family: inherit;"><a href="https://www.sciencedirect.com/science/article/abs/pii/001046559600032X">Poincaré sections of Hamiltonian systems (1996)</a></span></li><li><h3 style="box-sizing: border-box; caret-color: rgb(51, 51, 51); color: #333333; font-weight: 200; line-height: 1; margin: 0.2rem 0px 1rem; padding: 0px; text-rendering: optimizelegibility;"><span style="font-family: inherit; font-size: small;"><a href="https://journals.aps.org/pre/abstract/10.1103/PhysRevE.55.811">Statistical mechanics of ensemble learning, <span style="color: #555555;">Anders Krogh and Peter Sollich (1997)</span></a></span></h3></li></ul></div><div style="text-align: left;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: left;"><span style="font-family: inherit;">Please cite as follows:</span></div><div style="text-align: left;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: left;"><span style="font-family: inherit;"><div><span style="font-family: inherit;"> @misc{suezen22pom, </span></div><div><span style="font-family: inherit;"> title = { Physics origins of the most important statistical ideas of recent times }, </span></div><div><span style="font-family: inherit;"> howpublished = {\url{</span>http://science-memo.blogspot.com/2022/02/physics-origins-of-most-important.html<span style="font-family: inherit;">}, </span></div><div><span style="font-family: inherit;"> author = {Mehmet Süzen},</span></div><div><span style="font-family: inherit;"> year = {2022}</span></div><div><span style="font-family: inherit;"> }</span></div><blockquote><div></div></blockquote></span></div><div style="text-align: left;"><b><span style="font-family: inherit;">Appendix: Pearson correlation and Lattices</span></b></div><div style="text-align: left;"><b><span style="font-family: inherit;"><br /></span></b></div><div style="text-align: left;"><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><span style="font-family: inherit;">Auguste Bravais is famous for his contribution in foundational work on the mathematical theory for crystallography, now seems to be going far beyond periodic solids. Unknown to many, he actually first driven the expression for what we know today as correlation coefficient or Pearson’s correlation or less commonly Pearson-Galton coefficient. Interestingly, one of the grandfathers of causal analysis Wright is mentioned this in his seminal work of 1921 titled “Correlation and causation” acknowledged Bravais for his 1849 work as the first derivation of correlation.</span></span></div><div style="text-align: left;"><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><span style="font-family: inherit;"><br /></span></span></div><div style="text-align: left;"><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><span style="font-family: inherit;"><b>Appendix: Partition function and set theoretic probability</b></span></span></div><div style="text-align: left;"><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><span style="font-family: inherit;"><br /></span></span></div><div style="text-align: left;"><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><span style="font-family: inherit;"><span color="rgba(0, 0, 0, 0.9)">Long before Kolmogorov set forward his formal foundations of probabilities, Boltzmann, Maxwell and Gibbs build theories of statistical mechanics using probabilistic language and even define settings for set theoretic foundations by introducing ensembles for thermodynamics. For example, partition function (Z) appeared as defining a normalisation factor that summation of densities should yield to 1. Apparently Kolmogorov and contemporaries inspired a lot from physics and mechanics literature.</span></span></span></div><div style="text-align: left;"><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><span style="font-family: inherit;"><span color="rgba(0, 0, 0, 0.9)"><br /></span></span></span></div><div style="text-align: left;"><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><span style="font-family: inherit;"><span color="rgba(0, 0, 0, 0.9)"><b>Appendix: Generative AI</b></span></span></span></div><div style="text-align: left;"><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><span style="font-family: inherit;"><span color="rgba(0, 0, 0, 0.9)"><br /></span></span></span></div><div style="text-align: left;"><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><span style="font-family: inherit;"><span color="rgba(0, 0, 0, 0.9)">Of course now generative AI took over the hype. Indeed physics of diffusion from Fokker-Planck equation to basic Langevin dynamics is leveraged. </span></span></span></div><div style="text-align: left;"><span style="font-family: inherit;"> </span></div><div style="text-align: left;"><span style="font-family: inherit;"><b><span>Appendix: </span>Physics is fundamental for the advancement of AI research and practice </b></span></div><div style="text-align: left;"><span style="font-family: inherit;">
<p style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px; min-height: 15px;"><span style="font-family: inherit;"><br /></span></p>
<p style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;">AI as a phenomena appears to be in the domain of core physics. For this reason, studying physics as a (post)-degree or as a self-study modules will give students and practitioners alike a definitive cutting-edge insights. </span></p>
<ul>
<li style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;"><span style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal;"></span>Statistical models based on correlations originates from physics of periodic solids and astrophysical n-body dynamics.</span></li>
<li style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;"><span style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal;"></span>Neural networks originates from the modelling magnetic materials in discrete states and later named as cooperative phenomenon. Their training dynamics closely follows free-energy minimisation.</span></li>
<li style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;"><span style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal;"></span>Causality roots in ensemble theory of physical entropy.</span></li>
<li style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;"><span style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal;"></span>Almost all sampling based techniques are based on the idea of sampling physics of energy surfaces, i.e. Potential Energy Surfaces. (PES).</span></li>
<li style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;"><span style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal;"></span>Generative AI originates from physics of diffusion of fluids: classical Liouville description of the classical mechanics, i.e, phase-space flows and generalised Fokker-Planck dynamics. </span></li>
<li style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;"><span style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal;"></span>Language models based on attention are actually coarse-grained entropy-dynamics <br />
introduced by Gibbs: ‘Attention Layers’ behaves as coarse-graining procedure, i.e, compressed<br />
causal graphs mapping.</span></li></ul>
<p style="font-feature-settings: normal; font-kerning: auto; font-optical-sizing: auto; font-size-adjust: none; font-stretch: normal; font-style: normal; font-variant-alternates: normal; font-variant-caps: normal; font-variant-east-asian: normal; font-variant-ligatures: normal; font-variant-numeric: normal; font-variant-position: normal; font-variation-settings: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;">This is not about building analogies to physics but as foundational topics to AI.</span></p><div><br /></div></span></div>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-7824056538205276392021-11-15T11:46:00.006-08:002023-02-08T01:02:16.037-08:00Periodic Spectral Ergodicity Accurately Predicts Deep Learning Generalisation<p> <b>Preamble</b> </p><p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/en/d/dd/The_Persistence_of_Memory.jpg" imageanchor="1" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="271" data-original-width="368" height="236" src="https://upload.wikimedia.org/wikipedia/en/d/dd/The_Persistence_of_Memory.jpg" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span> Dali (1931), <br />The Persistence of Memory (Wikipedia)</span></td></tr></tbody></table><br />One of the<a href="https://science-memo.blogspot.com/2021/07/random-matrix-theory-deep-learning.html"> new mathematical concepts arise due to understanding of deep learning</a> is called periodic spectral ergodicity (PSE). The cascading PSE (cPSE) propagates over deep learning layers which can also be used as a complexity measure. cPSE actually can also predict the generalisation ability. In this post, we review this interesting finding in an easy and short manner.</p><p><b>How periodic spectral ergodicity cascades over layers</b></p><p></p>We have reviewed spectral ergodicity in a gentle fashion earlier, <a href="http://science-memo.blogspot.com/2021/07/spectral-ergodicity-deep-learning.html">here</a>. Only difference is that in real deep learning architectures, length of the eigenvalue spectrum, i.e., the number of bins in the histogram, generated by weight matrices are not equal in size. To align them, we use something called periodic boundary conditions or turn the eigenvalues in a cyclic fashion, up to the maximum length spectra we have seen up to that layer. Here are the steps that give, the intuition of how to compute cascading periodic spectral ergodicity (cPSE).<p></p><p>1. We compute eigenvalue spectrum up to a layer $i$ and align the smaller spectrum with periodic boundary conditions, i.e., cyclic.</p><p>2. Compute spectral ergodicity at layers $i$ and $i-1$.</p><p>3. Compute the cascading PSE at layer $i$ simply with a distance metric $\Omega^{i}$ and $\Omega^{i-1}$. i.e., KL divergence in two directions, recall earlier tutorials. </p><p>If we repeat this up to the last layer, cPSE measures the complexity of the deep learning architecture, both capturing structural and learning algorithm-wise, in a depth of a layer fashion. </p><p><b> Generalisation Gap and cPSE</b></p><p>Apart from being a complexity measure, cPSE predicts the generalisation gap given reference architecture i.e., it correlates with the performance almost perfectly. These findings are presented in the paper <a href="https://arxiv.org/abs/1911.07831">suzen2019</a> .</p><p><b>Conclusions and Outlook</b></p><p>The complexity of deep learning architectures are still an open research problem. One of the most promising direction is to use cPSE in terms of capturing structural complexity as well. While other measures in the literature did not consider depth dependency, whereby cPSE appears to be the first one.</p><p><b>Reference</b></p><pre style="overflow-wrap: break-word; white-space: pre-wrap; word-wrap: break-word;">@article{<a href="https://arxiv.org/abs/1911.07831">suzen2019</a>,
title={Periodic Spectral Ergodicity: A Complexity Measure for Deep Neural Networks and Neural Architecture Search},
author={S{\"u}zen, Mehmet and Cerd{\`a}, Joan J and Weber, Cornelius},
journal={arXiv preprint arXiv:1911.07831},
year={2019}
}</pre><p>Cite this post as <span style="font-family: courier;">Periodic Spectral Ergodicity Accurately Predicts Deep Learning Generalisation, Mehmet Süzen, <span color="rgba(0, 0, 0, 0.52)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.52);">https://science-memo.blogspot.com/2021/11/periodic-spectral-ergodicity-predicts-generalisation-deep-learning.html 2021</span></span></p><p><b>Appendix</b> </p><p>Bristol v0.12.2 is now supporting in computing cPSE from list of matrices</p><p style="font-family: "Helvetica Neue"; font-size: 13px; font-stretch: normal; line-height: normal; margin: 0px;">from bristol import cPSE</p><p style="font-family: "Helvetica Neue"; font-size: 13px; font-stretch: normal; line-height: normal; margin: 0px;">import numpy as np</p><p style="font-family: "Helvetica Neue"; font-size: 13px; font-stretch: normal; line-height: normal; margin: 0px;">np.random.seed(42)</p><p style="font-family: "Helvetica Neue"; font-size: 13px; font-stretch: normal; line-height: normal; margin: 0px;">matrices = [np.random.normal(size=(64,64)) for _ in range(10)]</p><p style="font-family: "Helvetica Neue"; font-size: 13px; font-stretch: normal; line-height: normal; margin: 0px;">(d_layers, cpse) = cPSE.cpse_measure_vanilla(matrices) </p><p><br /></p><p><br /></p>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-32613345064514395432021-07-28T08:46:00.007-07:002022-02-21T09:33:27.928-08:00 Deep Learning in Mind a Gentle Introduction to Spectral Ergodicity<div style="text-align: left;"><b>Preamble</b></div><div style="text-align: left;"><br /><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/3/3c/Mona_Lisa_eigenvector_grid.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="558" data-original-width="800" height="280" src="https://upload.wikimedia.org/wikipedia/commons/3/3c/Mona_Lisa_eigenvector_grid.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span> Figure: Monalisa on <br />Eigenvector grids (Wikipedia)</span></td></tr></tbody></table><br />In the post, <a href="https://science-memo.blogspot.com/2021/07/random-matrix-theory-deep-learning.html">A New Matrix Mathematics for Deep Learning : Random Matrix Theory of Deep Learning</a>, we have outlined a new mathematical concepts that are aimed at deep learning but in general belonging to applied mathematics. Here, we dive into one of the concepts, <i>spectral ergodicity. We</i> aimed at conveying what does it mean and how to compute spectral ergodicity for a set of matrices, i.e., ensemble. We will use a visual aid and verbal descriptions of steps to produce a quantitative measure of spectral ergodicity. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">The idea of spectral ergodicity comes from quantum statistical physics but it is <a href="https://arxiv.org/abs/1704.08303">recently revived for deep learning</a> as a new concept in order to accommodate mathematical needs of explaining and understanding the complexity of deep learning architectures.</div><div style="text-align: left;"><br /><b>Understanding Spectral Ergodicity</b></div><div style="text-align: left;"><b><br /></b></div><div style="text-align: left;">The concept of ergodicity can get quiet mathematical even for a professional mathematician. <a href="http://science-memo.blogspot.com/2020/01/a-practical-understanding-of-ergodicity.html">A practical understanding of ergodicity</a> could lead to the law of large numbers statistically speaking. However, observed ergodicity for ensemble of matrices, i.e. over their eigenvalue spectrum, are not formally defined before in the literature, and only appeared in statistical quantum mechanics in a specialised case. Here we do a formal definition gently.</div><div style="text-align: left;"><br /></div><div style="text-align: left;">The spectral ergodicity of snapshot of values from $M$ matrices, where they are $N \times N$ sizes, denoted by $\Omega$, can be produce with the following steps:</div><div style="text-align: left;"><ol style="text-align: left;"><li>Compute eigenvalues of $M$ matrices separately. </li><li>Produce equidistance spectra of matrices out of eigenvalues, i.e., histograms with $b_{k}$ bins. Each cell in the Figure corresponds to bin in the spectra of the matrices. </li><li>Compute average values over each bin across $M$ matrices.</li><li>Computing root mean square deviation that went to each bin from $M$ matrices from corresponding ensemble averaged value and average over $M$ and $N$. This will give a distribution, $\Omega=\Omega(b_{k})$, which represents spectral ergodicity value, think as a snapshot value of a dynamical process.</li></ol><div>Attentive reader would notice that normally, measures of ergodicity leads to a single value, such as in <a href="https://journals.aps.org/pre/abstract/10.1103/PhysRevE.90.032141">spin-glasses</a>, but here we obtain ergodicity as a measure distribution. This stems from the fact that our observable is not univariate but it is a multivariate measure over spectra of the matrix, i.e., bins in the histogram of eigenvalues. </div><div><br /></div><div><b>Why spectral ergodicity important for deep learning? </b></div><div><br /></div><div>The reason why this measure is so important lies in dynamics and consistency in measuring observables (<b>no</b> nothing to do with quantum mechanics but time and ensemble averages classically). Normally we can't measure ensemble averages. In experimental conditions the measurement we do is usually a time averaged value. This is exactly what happens when we train deep neural network, i.e, ergodicity of weight matrices. Essentially, spectral ergodicity would capture deep neural network's characteristics.</div></div><div class="post-header" style="line-height: 1.6; margin: 0px 0px 1em;"><div class="post-header-line-1"></div></div><div class="post-body entry-content" id="post-body-5528673375881843167" itemprop="description articleBody" style="line-height: 1.4; position: relative; width: 956px;"></div><b>Outlook</b><div><b><br /></b></div><div>The way we express spectral ergodicity here would only consider all layer having the same size. One would need a more advanced computation of spectral ergodicity for more realistic architectures, which is called <a href="https://memosisland.blogspot.com/2019/12/bringing-back-occams-razor-to-modern.html?utm_campaign=UA-41973481-2&utm_medium=email&utm_source=Revue%20newsletter">cascading Periodic Spectral Ergodicity measure</a> suitable as a complexity measure for deep learning. The computation of such measure is more involved and spectral ergodicity we cover here is the first step.<div><div><br /></div></div><div>Cite this post with <span style="font-family: courier;">Deep Learning in Mind Very Gentle Introduction to Spectral Ergodicity, Mehmet Süzen, (2021) <span style="background-color: white; caret-color: rgba(0, 0, 0, 0.52);">https://science-memo.blogspot.com/2021/07/deep-learning-random-matrix-theory-spectral-ergodicity.html</span></span> </div></div>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-55286733758818431672021-07-21T09:32:00.004-07:002022-11-26T10:53:39.887-08:00A New Matrix Mathematics for Deep Learning : Random Matrix Theory of Deep Learning <div style="text-align: left;"><b><span style="font-family: inherit;"> Preamble </span></b></div><p style="text-align: justify;"></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVRelDhIcOHVSh9x-VGkDFmzOzJDzVSBC0jKCpJb7p6QVW-yIpesxfHgLZaL5zUJo7ilCYRPLy3yMb94wX0_6cbYD_gSW-0GHVhQc4wBsBMFOHp7Hokre7iHMN5cRucmdRXnjp60196pKZ/s716/compagner.png" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><span style="font-family: inherit;"><img border="0" data-original-height="684" data-original-width="716" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVRelDhIcOHVSh9x-VGkDFmzOzJDzVSBC0jKCpJb7p6QVW-yIpesxfHgLZaL5zUJo7ilCYRPLy3yMb94wX0_6cbYD_gSW-0GHVhQc4wBsBMFOHp7Hokre7iHMN5cRucmdRXnjp60196pKZ/s320/compagner.png" width="320" /></span></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="font-family: inherit;"> Figure: Definition of Randomness<br /> (Compagner 1991, Delft University)</span></td></tr></tbody></table><span style="font-family: inherit;"><a href="https://awards.acm.org/about/2018-turing">Development of deep learning systems</a> (DLs) increased our hopes to develop more autonomous systems. Based on the hierarchal<a href="https://doi.org/10.1109/TPAMI.2013.50"> learning of representations</a>, deep learning defies the basic learning theory that beg the question of <a href="https://cacm.acm.org/magazines/2021/3/250713-understanding-deep-learning-still-requires-rethinking-generalization/fulltext?mobile=false">still rethinking generalisation</a>. Even though DLs lacks severely <a href="https://amturing.acm.org/award_winners/pearl_2658896.cfm">the ability to reason without causal inference</a>, they can't do that in vanilla form. However despite this limitation, they provide very rich new mathematical concepts as introduced recently. Here, we review couple of these new concepts briefly and draw attention to <a href="http://www.scholarpedia.org/article/Random_matrix_theory">Random Matrix Theory</a>'s relevance in DLs and its applications in Brain networks. These concepts in isolation are subject of applied mathematics but their interpretation and usage in deep learning architectures are demonstrated recently. In this post we provide a glossary of new concepts, that are not only theoretically interesting, they are directly practical from measuring architecture complexity to equivalance. </span><p></p><div style="text-align: left;"><span style="font-family: inherit;"><b>Random matrices can simulate deep learning architectures with spectral ergodicity</b></span></div><div style="text-align: left;"><span style="font-family: inherit;"><br /></span></div><div style="text-align: left;"><span style="font-family: inherit;">Random Matrix Theory (RMT) has origins in foundation of mathematical statistics and mathematical physics pioneered by <a href="https://en.wikipedia.org/wiki/Wishart_distribution">Wishart Distribution</a> and <a href="https://en.wikipedia.org/wiki/Circular_ensemble">Dyson Circular Ensembles</a>. As primary ingredient of a deep learning model as a result are set of weights, or learned parameter set, manifests as matrices and they come from a learning dynamics that are used in so called in inference time. Natural consequence of this, learning these matrices can be simulated via Random matrices of <a href="https://en.wikipedia.org/wiki/Spectral_radius">spectral radius </a>close to unity. This provides us the following, <u><i>ability to make a generic statement about deep learning systems</i></u> independent of </span></div><div style="text-align: left;"><ol style="text-align: left;"><li><span style="font-family: inherit;">Network architecture (topology).</span></li><li><span style="font-family: inherit;">Learning algorithm. </span></li><li><span style="font-family: inherit;">Data sizes and type.</span></li><li><span style="font-family: inherit;">Training procedure.</span></li></ol></div><p style="text-align: left;"><b><span style="font-family: inherit;">Why not Hessian or loss-landscape but Weight matrices? </span></b></p><p style="text-align: left;"><span style="font-family: inherit;">There are studies taking Hessian matrix as a major object, i.e., second derivative of parameters as a function of loss of the network and associate this to random matrices. However, this approach would only covers learning algorithm properties rather than architectures inference or learning capacity. For this reason, weight matrices should be taken as a primary object in any studies of random matrix theory in deep learning as they encode depth in deep learning. Similarly, loss-landscape can not capture the capacity of deep learning. </span></p><p><b><span style="font-family: inherit;">Conclusion and outlook</span></b></p><div><span style="font-family: inherit; text-align: justify;">In this short exposition, we tried to stimulate readers interest in exciting set of tools from RMTs for deep learning theory and practice. That is still subject of recent research with direct practical relevance. We provided glossary and reading list as well. </span></div><p><b><span style="font-family: inherit;">Further Reading</span></b></p><p><span style="font-family: inherit;">Papers introducing new mathematical concepts in deep learning are listed here, they come with associated Python codes for reproducing the concepts.</span></p><p></p><p></p><p></p><p></p><ul style="text-align: left;"><li><a href="https://arxiv.org/abs/1704.08303"><span style="font-family: inherit;">Spectral Ergodicity in Deep Learning Architectures via Surrogate Random Matrices</span></a></li><li><span style="font-family: inherit;"><a href="https://arxiv.org/abs/1911.07831">Periodic Spectral Ergodicity: A Complexity Measure for Deep Neural Networks and Neural Architecture Search</a></span></li><li><a href="https://arxiv.org/abs/2006.13687"><span style="font-family: inherit;">Equivalence in Deep Neural Networks via Conjugate Matrix Ensembles</span></a></li></ul><div><span style="font-family: inherit;">Earlier relevant blog posts </span></div><div><ul style="text-align: left;"><li><a href="http://science-memo.blogspot.com/2020/02/freeman-dysons-contribution-to-deep.html"><span style="font-family: inherit;">Freeman Dyson's contribution to deep learning: Circular ensembles mimics trained deep neural networks</span></a></li><li><a href="https://www.kdnuggets.com/2020/01/occams-razor-deep-learning.html"><span style="font-family: inherit;">Applying Occam's razor's to Deep Learning</span></a></li><li><a href="http://science-memo.blogspot.com/2020/01/a-practical-understanding-of-ergodicity.html"><span style="font-family: inherit;">A practical understanding of ergodicity</span></a></li><li><a href="http://science-memo.blogspot.com/2020/12/statistical-physics-origins-of.html"><span style="font-family: inherit;">Statistical Physics Origins of Connectionist Learning: <br /></span></a><div><a href="http://science-memo.blogspot.com/2020/12/statistical-physics-origins-of.html"><span style="font-family: inherit;">Cooperative Phenomenon to Ising-Lenz Architectures</span></a></div></li></ul></div><p><b><span style="font-family: inherit;">Citing this post</span></b></p><div style="text-align: left;"><span style="font-family: inherit;"><span>A New Matrix Mathematics of Deep Learning: </span><span>Random Matrix Theory of Deep Learning :</span><span> </span><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.52);">https://science-memo.blogspot.com/2021/07/random-matrix-theory-deep-learning.html </span><span>Mehmet Süzen, 2021</span></span></div><p><b><span style="font-family: inherit;">Glossary of New Mathematical Concepts of Deep Learning</span></b></p><p><span style="font-family: inherit;">Summary of the definition of new mathematical concepts for new matrix mathematics.</span></p><p><span style="font-family: inherit;"><b>Spectral Ergodicity</b> Measure of ergodicity in spectra of a given random matrix ensemble sizes. Given set of matrices of equal size that are coming from the same ensemble, average deviation of spectral densities of individual eigenvalues over ensemble averaged eigenvalue. This mimic standard ergodicity, instead of over states of the observable, it measures ergodicity over eigenvalue densities. $\Omega_{k}^{N}$, $k$-th eigenvalue and matrix size of $N$.</span></p><p><span style="font-family: inherit;"><b>Spectral Ergodicity Distance</b> A symmetric distance constructed with two Kullback-Leibler distances over two different size matrix ensembles, in two different direction. $D = KL(N_{a}|N_{b})+ KL(N_{b}|N_{a})$</span></p><p><span style="font-family: inherit;"><b>Mixed Random Matrix Ensemble (MME)</b> Set of matrices constructed from a random ensemble but with difference matrix sizes from N to 2, sizes determined randomly with a coefficient of mixture. </span></p><p><span style="font-family: inherit;"><b>Periodic Spectral Ergodicity (PSE) </b>A measure of Spectral ergodicity for MMEs whereby smaller matrix spectrum placed in periodic boundary conditions, i.e., cyclic list of eigenvalues, simply repeating them up to N eigenvalues. </span></p><p><span style="font-family: inherit;"><b>Layer Matrices</b> Set of learned weight matrices up to a layer in deep learning architecture. Convolutional layers mapped into a matrix, i.e. stacked up. </span></p><p><span style="font-family: inherit;"><b>Cascading Periodic Spectral Ergodicity (cPSE)</b> Measuring PSE over feedforward manner in a deep neural network. Ensemble size is taken up-to that layer matrices. </span></p><p><span style="font-family: inherit;"><b>Circular Spectral Deviation (CSD)</b> This is a measure of fluctuations in spectral density between two ensembles.</span></p><p><span style="font-family: inherit;"><b>Matrix Ensemble Equivalence </b>If CSDs are vanishing for conjugate MMEs, they are said to be equivalent.</span></p><p><b><span style="font-family: inherit;">Appendix: Practical Python Example</span></b></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;">Complexity measure for deep architectures and random matrix ensembles: </span><span style="font-family: courier;">cPSE.cpse_measure_vanilla</span><span style="font-family: inherit;"> Python package <a href="https://pypi.org/project/bristol/">Bristol</a> (</span><span style="font-family: courier;">>= v0.2.12</span><span style="font-family: inherit;">) has now a support for computing cPSE from a list of matrices, no need to put things in torch model format by default.</span></p><p style="font-stretch: normal; line-height: normal; margin: 0px; min-height: 15px;"><span style="font-family: inherit;"><br /></span></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: courier;">!pip install bristol==0.2.12</span></p><p style="font-stretch: normal; line-height: normal; margin: 0px; min-height: 15px;"><span style="font-family: inherit;"><br /></span></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;">An example case:</span></p><p style="font-stretch: normal; line-height: normal; margin: 0px; min-height: 15px;"><span style="font-family: inherit;"><br /></span></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: courier;">from bristol import cPSE</span></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: courier;">import numpy as np</span></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: courier;">np.random.seed(42)</span></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: courier;">matrices = [np.random.normal(size=(64,64)) for _ in range(10)]</span></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: courier;">(d_layers, cpse) = cPSE.cpse_measure_vanilla(matrices) </span></p><p style="font-stretch: normal; line-height: normal; margin: 0px; min-height: 15px;"><span style="font-family: inherit;"><br /></span></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: courier;">d_layers</span><span style="font-family: inherit;"> is decreasing vector, it will saturate at some point, that point is where adding more</span></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;">layers won’t improve the performance. This is data, learning or architecture independent measure.</span></p><p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;">Only a French word can explain the excitement here: <b>Voilà!</b></span></p><div><b><br /></b></div><p><br /></p><p><br /></p><p><br /></p>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-84581493213449040662021-04-23T12:25:00.007-07:002023-09-16T06:31:15.682-07:00On the fallacy of replacing physical laws with machine-learned inference systems<p><b><span style="font-family: inherit;">Preamble</span></b></p><p style="text-align: justify;"><span style="font-family: inherit;">Progress in machine learning, specifically so-called <a href="https://www.deeplearningbook.org">deep learning</a>, last decade was astonishingly successful in many areas from <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network">computer vision</a> to <a href="https://en.wikipedia.org/wiki/GPT-3">natural language translation </a>reaching automation close to human-level performance in narrow areas, so-called narrow artificial intelligence. At the same time, the scientific and academic communities also joined in applying deep learning in physics and in general physical sciences. If this is used as an assistance to known techniques, it is really good progress, such as drug discovery, accelerating molecular simulations or astrophysical discoveries to understand the universe. However, unfortunately, it is now almost standard claim that one supposedly could replace physical laws with deep learning models: we criticise these claims in general without naming any of our colleagues or works. </span></p><p><b><span style="font-family: inherit;">Circular reasoning: Usage of data produced by known physics </span></b></p><p></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/Blind_monks_examining_an_elephant.jpg/1920px-Blind_monks_examining_an_elephant.jpg" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><span style="font-family: inherit;"><img border="0" data-original-height="580" data-original-width="800" height="290" src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/Blind_monks_examining_an_elephant.jpg/1920px-Blind_monks_examining_an_elephant.jpg" width="400" /></span></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="font-family: inherit;">Blind monks examining an elephant <br />(Wikipedia)</span></td></tr></tbody></table><span style="font-family: inherit;"><br />The primary fallacy on papers claiming to be able to produce a learning system that can actually produce physical laws or replace physics with a deep learning system lies in how these systems are trained. Regardless of how good they are in predictions, their primary ability is the product of already known laws. They would only replicate the laws provided within datasets that are generated by physical laws. </span><p></p><p><span style="font-family: inherit;"><b>Faulty generalisation: Computational acceleration </b><b>in narrow application to replacing laws</b></span></p><p><span style="font-family: inherit;">One of the major <a href="https://en.wikipedia.org/wiki/Faulty_generalization">faults</a> in concluding that a machine-learned inference system doing better than the physical law is the faulty generalisation of computational acceleration in narrow application areas. This computational acceleration can not be generalised to all parameter space while systems are usually trained in certain restricted parameter space that physical laws generated data, for example solving <a href="https://en.wikipedia.org/wiki/N-body_problem">N-body problems</a>, or dynamics in any scale from <a href="https://en.wikipedia.org/wiki/Action_(physics)">action </a>or Lagrangian and generating fundamental particle physics Lagrangians.</span></p><p><b><span style="font-family: inherit;">Benefits: Causality still requires scientist</span></b></p><p><span style="font-family: inherit;">The intention of this short article here aimed at showing limitations of using machine-learned inference systems in discovering scientific laws: there are of course benefits of leveraging machine learning and data science techniques in physical sciences, especially accelerating simulations in narrow specialised areas, automating tasks and assisting scientist in cumbersome validations, such as searching and translating in two domains, especially in medicine and astrophysics, for example sorting images of galaxy formations. However, the results would still need a skilled physicist or scientist to really understand and form a judgment for a scientific law or discovery, i.e., <a href="https://en.wikipedia.org/wiki/Judea_Pearl">establishing causality</a>. </span></p><p><b><span style="font-family: inherit;">Conclusion : No automated physicist or automated scientific discovery</span></b></p><p><span style="font-family: inherit;"><a href="https://en.wikipedia.org/wiki/Artificial_general_intelligence">Artificial general intelligence</a> is not founded yet and has not been achieved. It is for the benefit of physical sciences that researchers do not claim that they found a deep learning system that can replace physical laws in supervised or semi-supervised settings rather concentrate on applications that benefit both theoretical and applied advancement in down to earth fashion. Similarly, funding agencies should be more reasonable and avoid funding such claims.</span></p><p><span style="font-family: inherit;">In summary, if datasets are produced by known physical laws or mathematical principles, the new deep learning system only replicates what was already known and it is not new knowledge, regardless of how these systems can predict or behave with new predictions.<i> Caution is advised</i>. We can not yet replace physicists with machine-learned inference systems, actually, not even <a href="https://en.wikipedia.org/wiki/Radiology">radiologists</a> are replaced, despite the impressive advancement in computer vision that produces super-human results. </span></p><div style="text-align: left;"><span style="font-family: inherit;"><b><br /></b></span></div><div style="text-align: left;"><span style="font-family: inherit;"><div style="caret-color: rgb(51, 51, 51); color: #333333; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> @misc{suezen21fallacy, </span></div><div style="caret-color: rgb(51, 51, 51); color: #333333; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> title = {</span><span style="font-family: inherit;">On the fallacy of replacing physical laws with machine-learned inference systems</span><span style="font-family: inherit;">}, </span></div><div style="caret-color: rgb(51, 51, 51); color: #333333; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> howpublished = {\url{</span>http://science-memo.blogspot.com/2021/04/on-fallacy-of-replacing-physical-laws.html<span style="font-family: inherit;">}}, </span></div><div style="caret-color: rgb(51, 51, 51); color: #333333; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> author = {Mehmet Süzen},</span></div><div style="caret-color: rgb(51, 51, 51); color: #333333; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;"> year = {2021}</span></div><div style="caret-color: rgb(51, 51, 51); color: #333333; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px;"><span style="font-family: inherit;">}</span> </div></span></div><div style="text-align: left;"><span style="font-family: inherit;"><b><br /></b></span></div><div style="text-align: left;"><span style="font-family: inherit;"><b><br /></b></span></div><div style="text-align: left;"><span style="font-family: inherit;"><b><br /></b></span></div><div style="text-align: left;"><span style="font-family: inherit;"><b>Postscripts</b></span></div><div style="text-align: left;"><span style="font-family: inherit;"><b><br /></b></span></div><div style="text-align: left;"><span><span style="font-family: inherit;">The </span>following<span style="font-family: inherit;"> interpretations, reformulations are curated after initial post. </span></span></div><div style="text-align: left;"><span style="font-family: inherit;"><b><br /></b></span></div><div style="text-align: left;"><span style="font-family: inherit;"><b><br /></b></span></div><div style="text-align: left;"><span style="font-family: inherit;"><b>Postscript 1: Regarding Symbolic regression</b></span></div><div style="text-align: left;"><span style="font-family: inherit;"><br /><span style="background-color: white;"><span><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">There are now </span></span><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">multiple</span><span><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;"> claims that one could replace physics with symbolic regression. Yes, symbolic regression is quite a powerful method. However, using raw data produced by physical laws, so called simulation data from classical mechanics or modelling experimental data guided by functional forms provided by physics do not imply that one could replace physics or physical laws with machine learned system. We have not achieved Artificial General Intelligence (AGI) and symbolic regression is not AGI. Symbolic regression may not be even useful beyond verification tool for theory and numerical solutions of physical laws.</span></span></span></span></div><div style="text-align: left;"><span style="background-color: white;"><span><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); font-family: inherit; white-space: pre-wrap;"><br /></span></span></span></div><div style="text-align: left;"><span style="background-color: white;"><span><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); font-family: inherit; white-space: pre-wrap;"><b>Postscript 2: Fallacy on the dimensionality reduction and distillation of physical laws with machine learning</b></span></span></span></div><div style="text-align: left;"><span style="background-color: white;"><span><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); font-family: inherit; white-space: pre-wrap;"><br /></span></span></span></div><div style="text-align: left;"><span style="font-family: inherit;"><span style="background-color: white;"><span><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;"><span style="caret-color: rgb(0, 0, 0); white-space: normal;"><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">There are now </span></span><span color="rgba(0, 0, 0, 0.9)">multiple</span><span style="caret-color: rgb(0, 0, 0); white-space: normal;"><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;"> claims that one could distill physical dynamical laws with dimensionality reduction. This is indeed a novel approach. However, the core dataset is generated by the coupled </span></span></span></span></span><span style="background-color: white;"><span><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;"><span style="caret-color: rgb(0, 0, 0); white-space: normal;"><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">set of </span></span></span></span></span><span style="background-color: white;"><span><span style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">dynamical equations that is suppose to be reduced </span></span><span style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">with fixed set of initial conditions. This </span></span><span style="background-color: white;"><span style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">does not imply any kind of distillation of set of original laws, i.e., t</span></span><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">he procedure can not be qualified </span><span style="background-color: white;"><span style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">as distilling</span> </span><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">set of equations to less number of equations or variates. It only provides an accelerated </span><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">deployment of dynamical solvers under very specific conditions. This includes any renormalisation group dynamics.</span></span></div><div style="text-align: left;"><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); font-family: inherit; white-space: pre-wrap;"><br /></span></div><div style="text-align: left;"><b><span style="font-family: inherit;"><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">Postscript 3: A new terms, </span><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">Scientific Machine Learning Fallacy and s-PINNs.</span></span></b></div><div style="text-align: left;"><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); font-family: inherit; white-space: pre-wrap;"><br /></span></div><p style="font-stretch: normal; line-height: normal; margin: 0px;"><span style="font-family: inherit;">Usage of symbolic regression with deep learning should be called <i>symbolic physics informed neural networks (s-PINNs</i>. Calling these approaches “machine scientist”, “automated scientist”, “physics laws generator” are technically a fallacy, <span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">i.e., Scientific Machine Learning Fallacy, primarily caught up in <span style="font-family: inherit;">circular reasoning. </span></span><span style="font-family: inherit;"><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;"> </span><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;"> </span></span></span><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); font-family: inherit; white-space: pre-wrap;"> </span></p><p style="text-align: left;"><span style="font-family: inherit;"><b>Postscript 4: <span style="caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">AutoML is a misnomer : Scientific Machine Learning (SciML) Fallacy </span></b></span></p><p style="border: var(--artdeco-reset-base-border-zero); box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; cursor: text; line-height: var(--artdeco-reset-typography_getLineHeight); margin: 0px; padding: 0px; vertical-align: var(--artdeco-reset-base-vertical-align-baseline); white-space: pre-wrap;"><span style="font-family: inherit;">SciML is immensely promising in providing accelerated deployment of known scientific workflows: specialised areas such as trajectory learning, novel operator solvers, astrophysical image processing, molecular dynamics and computational applied mathematics in general. Unfortunately, some recent papers continue on jumping into claims of automated scientific discovery and replacing known physical laws with supervised learning systems, including new NLP systems. </span></p><p style="border: var(--artdeco-reset-base-border-zero); box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; cursor: text; line-height: var(--artdeco-reset-typography_getLineHeight); margin: 0px; padding: 0px; vertical-align: var(--artdeco-reset-base-vertical-align-baseline); white-space: pre-wrap;"><span style="font-family: inherit;"><br style="box-sizing: inherit;" /></span></p><p style="border: var(--artdeco-reset-base-border-zero); box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; cursor: text; line-height: var(--artdeco-reset-typography_getLineHeight); margin: 0px; padding: 0px; vertical-align: var(--artdeco-reset-base-vertical-align-baseline); white-space: pre-wrap;"><span style="font-family: inherit;">The primary fallacy on papers claiming to be able to produce a learning system that can actually produce physical/scientific laws or replace physics/science with a deep learning system lies in how these systems are trained. AutoML in this context actually doesn’t replace scientist but abstract out former workflows into different meta scientific work assisting scientists: hence a misnomer, MetaML is probably more suited terminology. </span></p><p><br /></p><p><br /></p>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com1tag:blogger.com,1999:blog-4550553973032503669.post-53680642519364670452021-04-01T10:40:00.007-07:002021-04-26T08:36:14.202-07:00Shifting Modern Data Science Forward: Dijkstra principle for data science<div style="text-align: left;"><h2><span style="font-size: small;"><span style="font-weight: normal;">Kindly reposted to <a href="http://www.kdnuggets.com/">KDnuggets</a> by <a href="https://en.wikipedia.org/wiki/Gregory_Piatetsky-Shapiro">Gregory Piatetsky-Shapiro </a></span><span style="font-weight: normal;">with the title <a href="https://www.kdnuggets.com/2021/04/dijkstra-principle-data-science.html">Data science is not about data -applying Dijkstra principle to data science</a> and enhancements.</span></span></h2></div><div style="text-align: left;"><b><br /></b></div><div style="text-align: left;"><b>Prelude</b></div><div style="text-align: left;"><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/c/c9/Edsger_Dijkstra_1994.jpg" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="548" data-original-width="800" height="219" src="https://upload.wikimedia.org/wikipedia/commons/c/c9/Edsger_Dijkstra_1994.jpg" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Dijkstra in Zurich, 1984 (Wikipedia)</td></tr></tbody></table><b><br /></b></div><div><span style="font-family: arial;"><a href="https://en.wikipedia.org/wiki/Edsger_W._Dijkstra">Edsger Dijkstra</a> was a Dutch theoretical physicist turned computer scientist, and probably one of the most influential earlier pioneers in the field. He had deep insight in what is computer science and well founded notion of how should it be taught in academics. In this post we extrapolate his ideas into data science. We developed something called, <i><u>Dijkstra principle for data science</u></i>, that is driven by his ideas on what does computer science entails.</span></div><h3 style="text-align: left;"><span style="font-size: small;">Computer Science and Astronomy </span></h3><div><span style="font-family: arial;"><a href="https://en.wikipedia.org/wiki/Astronomy">Astronomy</a> is not about telescopes. Indeed, it is about how universe works and how its constituent parts are interacting. Telescopes, either being optical or radio observations or similar detection techniques are merely tools to practice and do investigation for astronomy. A formed analogy goes into computer science as well, this is the quote from Dijkstra:</span></div><div><span style="font-family: arial;"><i><span></span><blockquote><span>Computer science is no more about computers than astronomy is about telescopes.</span> - <span>Edsger Dijkstra</span></blockquote></i></span></div><div><span style="font-family: arial;">The idea of Computer Science being not about computer is rather strange in the first instance. However, what Dijkstra had in mind is abstract mechanism and mathematical constructs that </span><span style="font-family: arial;">one can map real problems and solve it as a computer science problem, such as </span><a href="https://en.wikipedia.org/wiki/Graph_theory" style="font-family: arial;">graph algorithms</a><span style="font-family: arial;">. Though Computer Science had a lot of subfields but its inception can be considered as rooted in </span><a href="https://en.wikipedia.org/wiki/Applied_mathematics" style="font-family: arial;">applied mathematics</a><span style="font-family: arial;">.</span></div><div><span style="font-family: arial;"><br /></span></div><div><span style="font-family: arial;"><b>Dijkstra principle for data science</b></span></div><div><span style="font-family: arial;"><br /></span></div><div><span style="font-family: arial;">By using Dijkstra's approach now we are in position to formulate a principle for data science. </span></div><blockquote><div><i><span style="font-family: arial;">Data science is no more about data than computer science is about computers.</span> <span style="font-family: arial;">-Dijkstra principle for data science</span></i></div></blockquote><div><span style="font-family: arial;">This sounds absurd. If data science is not about data, then what is it about? Apart from definition of data science as an emergent field, as an amalgamation of multiple fields from statistics to high performance computing, the idea that data not being the core tenant of data science implies the practice does not aim at data itself rather a higher purpose. Data is used similar to a telescope in astronomy, the purpose is to reveal the empirical truths about <i>representations</i> data conveys. There is no unique ways to achieve this purpose. </span></div><div><span style="font-family: arial;"><br /></span></div><div><span style="font-family: arial;"><b>Conclusive Remarks</b></span></div><div><br /></div><div><span style="font-family: arial;"><i>Dijkstra principle for data science</i> would be very helpful in understanding the data science practice as <i>not data-centric</i>, contrary to mainstream dogma, rather as</span><span style="font-family: arial;"> a </span><i style="font-family: arial;">science-centric </i><span style="font-family: arial;">practice with the data being the primary tool to leverage, using multitude of techniques. Implication is that machine learning is a secondary tool on top of data in practicing data science. This attitude would help causality playing a major role shifting modern data science forward.</span></div><div><span style="font-family: arial;"><br /></span></div><div><span style="font-family: arial;"><br /></span></div>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-61153577337146822992021-03-20T14:34:00.008-07:002021-03-20T14:42:37.947-07:00 Computable function analogs of natural learning and intelligence may not exist<p><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;"><b><span face="-apple-system, system-ui, BlinkMacSystemFont, Segoe UI, Roboto, Helvetica Neue, Fira Sans, Ubuntu, Oxygen, Oxygen Sans, Cantarell, Droid Sans, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Lucida Grande, Helvetica, Arial, sans-serif"><br /></span><span style="font-family: arial;">Optimal learning : Meta-optimization </span></b></span></p><p><span style="font-family: arial;"><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">Many papers directly equate “machine” learning problem, algorithmic learning oppose to human or animal learning, with optimisation problem. Unfortunately, contrary to common belief machine learning is not an optimisation problem. For example, take <i>optimal learning strategy,</i> a replace learning with optimisation and we end up having and absurd terms of </span><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;"><i>optimal optimisation strategy</i></span><b style="caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;"> </b><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">at one point. </span></span></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/3/3d/Maquina.png" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="444" data-original-width="800" height="178" src="https://upload.wikimedia.org/wikipedia/commons/3/3d/Maquina.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Turing machine (Wikipedia)</td></tr></tbody></table><span style="font-family: arial;"><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">Sound like practiced machine learning is a meta-optimisation </span><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">problem, rather than a learning as humans do. </span></span><p></p><p><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;"><b><span style="font-family: arial;">Computable functions to learning</span></b></span></p><p><span style="font-family: arial;"><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">Fundamentally, we do not know how human learning can be mapped into an algorithm or if there are computable function analogs of human learning or if human intelligence and its artificial analog can be represented </span><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); white-space: pre-wrap;">as Turing computable manner.</span></span></p>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-9089925940002481772021-03-07T14:42:00.002-08:002021-03-07T14:47:30.578-08:00Critical look on why deployed machine learning model performance degrade quickly<div><span style="font-family: arial;"><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/a/ab/William_of_Ockham_-_Logica_1341.jpg" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="373" data-original-width="400" height="298" src="https://upload.wikimedia.org/wikipedia/commons/a/ab/William_of_Ockham_-_Logica_1341.jpg" title="William of Ockham" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Illustration of William of Ockham <br />(Wikipedia)</td></tr></tbody></table>One of the major problems in using so called machine learning model, usually a supervised model, in so called deployment, meaning it will serve new data points which were not in the training or test set, with great astonishment, modellers or data scientist observe that model's performance degrade quickly or it doesn't perform as good as test set performance. We earlier ruled out that<a href="http://science-memo.blogspot.com/2020/11/re-discovery-of-inverse-problems-what.html"> underspecification would not be the main cause</a>. Here we proposed that the primary reason of such performance degradation lies on the usage of hold out method in judging generalised performance solely.</span></div><div><span style="font-family: arial;"><br /></span></div><div><span style="font-family: arial;"><b>Why model test performance does not reflect in deployment? Understanding overfitting</b></span></div><div><span style="font-family: arial;"><br /></span></div><div><span style="font-family: arial;">Major contributing factor is due to inaccurate meme of overfitting which actually meant overtraining and connecting overtraining erroneously to generalisation solely. This was discussed earlier here as <a href="http://memosisland.blogspot.com/2017/08/understanding-overfitting-inaccurate.html">understanding overfitting</a>. Overfitting is not about how good is the function approximation compared to other subsets of the dataset of the same “<i>model</i>” works. Hence, the hold-out method (test/train) of measuring performances does not provide sufficient and necessary conditions to judge model’s generalisation ability: with this approach we can not detect overfitting (in Occam’s razor sense) and as well the deployment performance. </span></div><div><span style="font-family: arial;"><br /><b>How to mimic deployment performance?</b></span></div><div><span style="font-family: arial;"><b><br /></b>This depends on the use case but the most promising approaches lies in adaptive analysis and detected distribution shifts and build models accordingly. However, the answer to this question is still an open research.</span></div>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-57441826381717230512020-12-27T22:12:00.011-08:002023-08-13T11:25:02.376-07:00Statistical Physics Origins of Connectionist Learning: Cooperative Phenomenon to Ising-Lenz Architectures<div style="text-align: left;"><span style="font-family: arial;"><i>This is an informal essay in aiming at raising awareness that Statistical Physics played a foundational role in deep learning and neural networks in general beyond being a mare analogy <b>but its origin</b>. </i></span></div><div style="text-align: left;"><span style="font-family: arial;"><i><br /></i></span></div><div style="text-align: left;"><span style="font-weight: normal;">Article version of this post is available here: <a href="http://dx.doi.org/10.13140/RG.2.2.13632.40962">doi</a>. and on <a href="https://hal.archives-ouvertes.fr/hal-03650339/document">HAL Open Science</a></span></div><h3 style="text-align: left;">Preamble</h3><div style="text-align: left;"><div class="page" title="Page 1"><div class="layoutArea"><div class="column"><p style="text-align: justify;"><span style="font-family: arial;">A short account of origins of mathematical formalism of neural networks is presented for physicists and computer scientist in basic discrete mathematical setting informally. The discourse of the development of mathematical formalism on the dynamics of lattice models in statistical physics and learning internal representations of neural networks as discrete architectures as quantitative tools evolve in two almost distinct fields more than half a century with limited overlap. We aim at bridging the gap by claiming that the analogy between two approaches are not artificial but naturally occuring due to how modelling cooperative phenomenon is constructed. We define the<i> </i><i>Lenz-Ising architectures </i><i>(ILAs) </i>for this purpose.</span></p><h3 style="text-align: left;"><span style="font-family: arial;">Introduction</span></h3><div><span style="font-family: arial;"><br /></span></div><div class="page" title="Page 1"><div class="layoutArea"><div class="column"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/7/7d/Ising-tartan.png" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img alt="Tartan Ising Model" border="0" data-original-height="800" data-original-width="800" height="320" src="https://upload.wikimedia.org/wikipedia/commons/7/7d/Ising-tartan.png" title="Tartan Ising Model" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure: Tartan Ising Model <br />(Linas Viptas-Wikipedia)</td></tr></tbody></table><div style="text-align: left;"><span style="font-family: arial;">Understanding natural or artificial phenomenon in the language of discrete mathematics is probably one of the most powerful toolbox scientist use [1]. Large portion of computer science and statistical physics deals with such finite structures. One of the most prominent successful usage of such approach was Lenz and Ising’s work on modelling ferromagnetic materials [2–5] and neural networks as a model to biological neuronal structures [6–8].</span></div><p><span style="font-family: arial;">The analogy between two areas of distinct research have been pointed out by many researchers [9–13]. However, the discourse and evolution of these approaches were kept as two distinct research fields and many innovative approaches rediscovered under different names.</span></p><h3 style="text-align: left;"><span style="font-family: arial;">Cooperative Phenomenon</span></h3><div class="page" title="Page 1"><div class="layoutArea"><div class="column"><p><span style="font-family: arial;">Statistical definition of cooperative phenomenon pioneered by Wannier and Kremer [14–16]. Even though their technical work focused on extension of Ising model to 2D with cyclic boundary condition and introduction of exact solutions with matrix algebra, they were the first to document the potential of how Lenz-Ising model actually represent a more generic system than merely model to ferromagnets, namely anything falls under cooperative phenomenon can be addressed with Lenz-Ising type model, summarised in Definition 1.</span></p></div></div></div></div></div></div></div></div></div></div><p style="text-align: left;"><span style="font-family: arial;"><b>
Definition 1</b>: <b><i>Cooperative phenomenon of Wannier type</i> </b> [14]: Set of $N$ discrete units, $\mathscr{U}$, identified with a function $s_{i}$, i=1,..,N forms a collection or assembly. The function that identifies the units is a mapping $s_{i}: \mathbb{R} \rightarrow \mathbb{R}$. A statistic
$\mathscr{S}$ applied on $\mathscr{U}$ is called <i>cooperative phenomenon of Wannier type</i> $\mathscr{W}$.</span></p><div><div class="page" title="Page 2"><div class="layoutArea"><div class="column"><div style="text-align: left;"><span style="font-family: arial;">A statistic $\mathscr{S}$ can be any mapping or set of operations on the assembly of units $\mathscr{U}$ . For example inducing ordering on the assembly of units and summation over $s_{i}$ values, would corresponds to non-interacting magnetic system with unit external field or non-connected set of neurons capacity of inhibition or exhibition. However, amazingly, Definition 1 is so generic that Rosenblatt’s perceptron [17], current deep learning systems [18] and complex networks [19] falls into this category as well. </span></div><div style="text-align: left;"><span style="font-family: CMR10; font-size: 10pt;"><br /></span></div><div style="text-align: left;"><span style="font-family: arial;">The originality of Cooperative phenomenon of Wannier type comes on a secondary concept, so called event propagation as given in Definition 2.</span></div><div style="text-align: left;"><div class="page" title="Page 2"><div class="layoutArea"><div class="column"><p><span style="font-family: arial;"><b><span>Definition 2. </span>Event propagation</b> [14] </span><span style="font-family: arial;">An event is defined as a snapshot of cooperative phenomenon of Wannier type $\mathscr{W}$. If an event
takes place of one unit of assembly $\mathscr{U}$, the same event will be favored by other units, this
is expressed as event propagation between two disjoint set of units $\mathscr{E}(u_{1}, u_{2})$, and
$u_{1} \cap u_{2} = \varnothing$ and $u_{1}, u_{2} \in \mathscr{U}$ and with an
additional statistic $\mathscr{S}$ is defined.</span></p><div class="page" title="Page 2"><div class="layoutArea"><div class="column"><p><span style="font-family: arial;">The parallels between Wannier’s event propagations are remarkably the same as of neural network formalism defined by McCulloch-Pitts-Kleene [6,7], not only conceptually but matematical treatment is identical and originates from Lenz-Ising model’s treatment of discrete units. As we mentioned, this goes beyond doubt not a simple analogy but forms a generic framework as envisioned by Wannier. The similarity between ferromagnetic systems and neural networks is probably first documented directly by Little [8]: Spin states of magnetic spins corresponds to firing state of a neuron. Unfortunately, Little only see it as simple analogy, and missed the opportunity provided by Wannier as a generic natural phenomenon of cooperation.</span></p><p><span style="font-family: arial;">The conceptual similarity and inference on Wannier’s event propagation appears to be quite close to Hebb’s learning [20] and gives natural justification for backpropagation for multilayered networks. History of backpropagation is exhaustively studied elsewhere [18].</span></p><div class="page" title="Page 2"><div class="layoutArea"><div class="column"><h3 style="text-align: left;"><span style="font-family: CMBX10;">Lenz-Ising Architectures (ILAs): Ferromagnets to Nerve Nets</span></h3></div></div></div></div></div></div></div></div></div></div></div></div></div></div><blockquote style="border: medium; margin: 0px 0px 0px 40px; padding: 0px;"><div><div class="page" title="Page 2"><div class="layoutArea"><div class="column"><div style="text-align: left;"><div class="page" title="Page 2"><div class="layoutArea"><div class="column"><div class="page" title="Page 2"><div class="layoutArea"><div class="column"><div class="page" title="Page 2"><div class="layoutArea"><div class="column"><div style="text-align: left;"><span style="font-family: arial;"><br /></span></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div></blockquote><div><div class="page" title="Page 2"><div class="layoutArea"><div class="column"><div style="text-align: left;"><div class="page" title="Page 2"><div class="layoutArea"><div class="column"><div class="page" title="Page 2"><div class="layoutArea"><div class="column"><div class="page" title="Page 2"><div class="layoutArea"><div class="column"><div style="text-align: left;"><span style="font-family: arial;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4lLwxoRVeTqwY6yXyfl7lFWhTPjMBEGgIEdDPEBE75jqbFw8UtxfOp-HjVHugNMAJazDo5v0K6VGc_9sF-o4P6Z6Udtdd2RvU_4OqdFcv-I_qtqeGPnfgIhcEFICHCCXD_cg4fkqebnrv/s1002/ising.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1002" data-original-width="722" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh4lLwxoRVeTqwY6yXyfl7lFWhTPjMBEGgIEdDPEBE75jqbFw8UtxfOp-HjVHugNMAJazDo5v0K6VGc_9sF-o4P6Z6Udtdd2RvU_4OqdFcv-I_qtqeGPnfgIhcEFICHCCXD_cg4fkqebnrv/w231-h320/ising.png" width="231" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Ernst Ising<br /> <span style="font-size: x-small;">Image owner APS - Physics Today : <br />Obituary</span></td></tr></tbody></table>As we established two basic definitions of cooperative phenomenon, we can now define a generic setting of Lenz-Ising model that captures both physics literature that extensively used this in so called spin-glasses research and for neural networks. A guiding principle will be based on Wannier’s definition of cooperative phenomenon.</span></div><div style="text-align: left;"><span style="font-family: arial;"><br /></span></div></div></div></div></div></div></div></div></div></div></div></div></div></div></div><p style="text-align: left;"><span style="font-family: arial;"><b>
Definition</b>: <b>Lenz-Ising Architectures (ILAs)</b> <br />Given Wannier type cooperative phenomenon $\mathscr{W}$, imposing constrains
on the discrete units, $\mathscr{U}^{c}$ that they should be spatially correlated
on the edges $E$ of an arbitrary graph $\mathscr{G}(E, V)$ with ordering and
with vertices $V$ of the arbitrary graph carring coupling weight between connected
two units with biases. Set of event propagations $\mathscr{E}^{c}$ defined on
the cooperative phenomeon can induce dynamics on defining vertice weights, or
vice versa. ILAs are defined as statistic $\mathscr{S}$ applied to $\mathscr{U}^{c}$
with propagations $\mathscr{E}^{c}$. </span></p><div><div class="page" title="Page 3"><div class="layoutArea"><div class="column"><p><span style="font-family: arial;">Lenz-Ising Architectures (ILAs) should not be confused with graph neural networks as it does not model data structures. It could be seen as subset of graph dynamical systems in some sense but formal connections should be established elsewhere. However, primary characteristic of ILAs are that it is conceptual and mathematical representation of spin-glass systems (including Lenz-Ising, Anderson, Sherrington-Kirkpatrick, Potts systems) and neural networks (including recurrent and convolutional networks) under the same umbrella.</span></p><div class="page" title="Page 3"><div class="layoutArea"><div class="column"><h3 style="text-align: left;"><span style="font-family: arial;"> Learning representations inherent in Metropolis-Glauber dynamics</span></h3><p><span style="font-family: arial;">The primary originality in any neural network research papers lies in so called learning representation from data and generalisation. However, it isn’t obvious to the that community that actually spin-glasses are capable of learning representations inherently by induced dynamics such as Metropolis or Glauber dynamics by construction, as an inverse problem.</span></p><p><span style="font-family: arial;">In physics literature this appears as finding a solution to the problem of how to express free energy and minimisation of this with respect to weights or coupling coefficients, This is noting but a learning represenations. Usually a simulation approach is taken as a route, for example Monte Carlo techniques [5, 21, 22] via Metropolis or Glauber dynamics. The intimate connection between concepts of ergodicity and learning in deep learning is recently shown [13,23,24] in this context.</span></p><p><span style="font-family: arial;"></span></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/7/75/Roy_Glauber_Dec_10_2005.jpg" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="466" data-original-width="410" height="200" src="https://upload.wikimedia.org/wikipedia/commons/7/75/Roy_Glauber_Dec_10_2005.jpg" width="176" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Roy J. Glauber (Wikipedia) <br />Glauber dynamics</td></tr></tbody></table><span style="font-family: arial;"><br />As we argued earlier the generic definition provided by Wannier on cooperative phenomenon and ILAs; there is an intimate connection with learning and so called solving spin-glasses that usually boils down to computing free energies as mentioned. And a link between two distinct fields, computing backpropagation and free energies are natural candidates to establish equivalence relations.</span><p></p><h3 style="text-align: left;"><span style="font-family: arial;"><b>Conclusions and Outlook</b></span></h3><p><span style="font-family: arial;">Apart from honouring physicists Lenz and Ising, based on understanding of cooperative phenomenon’s origins, naming the research outpus from of spin-glasses and neural networks under an umbrella term Lenz-Ising architectures (ILAs) is historically accurate and technically a resonable naming scheme under the overwhelming evidence given in the literature. This is akin to naming current computers with von Neumann architectures. This forms the origins of connectionist learning from statistical physics, where this approach currently enjoying vast engineering success today.</span></p><p><span style="font-family: arial;">The rich connection between two areas in computer science and statistical physics should be celebrated. For more fruitful collaborations, both literatures, embracing large statistics literature as well, should converge much more closely. This would help communities to avoid awkward situations of reinventing the wheel again and hindering recognition of the work done by physicists decades earlies, i.e., Ising and Lenz.</span></p><h3 style="text-align: left;"><span style="font-family: arial; font-size: small;"> Notes</span></h3><p><span style="font-family: arial;">No competing or other kind of conflict of interest exists. This work is produced solely with the aim of scholarly work and does not have any personal nature at all. This essay is dedicated in memory of <a href="https://en.wikipedia.org/wiki/Ernst_Ising">Ernst Ising</a> for his contribution to physics of ferromagnetic materials, now seems to have far more implications.</span></p><p><span style="font-family: arial;"><b>References</b></span></p><div class="page" title="Page 4"><div class="layoutArea"><div class="column"><p><span style="font-family: arial;">[1] Kenneth H Rosen. Handbook of Discrete and Combinatorial Mathematics. CRC Press, 1999. </span></p><p><span style="font-family: arial;">[2] W.Lenz. Beitrag zum Verstl ̈andnis der Magnetischen Erscheinungen in Festen Korpern. </span><span style="font-family: arial;">Phys.Z</span><span style="font-family: arial;">, </span><span style="font-family: arial;">21:613, 1920.</span></p><p><span style="font-family: arial;">[3] Ernst Ising. Beitrag zur Theorie des Ferromagnetismus. Zeitschrift furr Physik, 31(1):253–258, 1925.</span></p><p><span style="font-family: arial;">[4] Thomas Ising, Reinhard Folk, Ralph Kenna, Bertrand Berche, and Yurij Holovatch. The fate </span><span style="font-family: arial;">of Ernst Ising and the fate of his model.</span><span style="font-family: arial;"> </span><span style="font-family: arial;">arXiv preprint arXiv:1706.01764</span><span style="font-family: arial;">, 2017.</span></p><p><span style="font-family: arial;">[5] David P Landau and Kurt Binder. A guide to Monte Carlo Simulations in Statistical Physics. </span><span style="font-family: arial;">Cambridge University Press, 2014.</span></p><p><span style="font-family: arial;">[6] W.S. McCulloch and W.H. Pitts. A Logical Calculus of the Ideas Imminent in Nervous Activity.</span><span style="font-family: arial;">Bull. Math. Biophys.,(5)</span><span style="font-family: arial;">, pages 115–133.</span></p><p><span style="font-family: arial;">[7] Stephen Cole Kleene. Representation of Events in Nerve Nets and Finite Automata. Technical </span><span style="font-family: arial;">report, RAND Project, Santa Monica, 1951.</span></p><p><span style="font-family: arial;">[8] W. A. Little. The Existence of Persistent States in the Brain. Mathematical Biosciences, </span><span style="font-family: arial;">19(1-2):101–120, 1974.</span></p><p><span style="font-family: arial;">[9] P Peretto. Collective Properties of Neural Networks: a Statistical Physics Approach. Biological </span><span style="font-family: arial;">Cybernetics</span><span style="font-family: arial;">, 50(1):51–62, 1984.</span></p><p><span style="font-family: arial;">[10] Jan L van Hemmen. Spin-glass Models of a Neural Network. Physical Review A, 34(4):3435, </span><span style="font-family: arial;">1986.</span></p><p><span style="font-family: arial;">[11] Haim Sompolinsky. Statistical Mechanics of Neural Networks. Physics Today, 41(21):70–80,</span><span style="font-family: arial;">1988.</span></p><p><span style="font-family: arial;">[12] David Sherrington. Neural Networks: the Spin Glass Approach. In North-Holland Mathematical</span><span style="font-family: arial;">Library</span><span style="font-family: arial;">, volume 51, pages 261–291. Elsevier, 1993.</span></p><p><span style="font-family: arial;">[13] Yasaman Bahri, Jonathan Kadmon, Jeffrey Pennington, Sam S Schoenholz, Jascha Sohl-</span><span style="font-family: arial;">Dickstein, and Surya Ganguli. Statistical Mechanics of Deep Learning.</span><span style="font-family: arial;"> </span><span style="font-family: arial;">Annual Review of </span><span style="font-family: arial;">Condensed Matter Physics</span><span style="font-family: arial;">, 2020.</span></p><p><span style="font-family: arial;">[14] Gregory H Wannier. The Statistical Problem in Cooperative Phenomena. Reviews of Modern </span><span style="font-family: arial;">Physics</span><span style="font-family: arial;">, 17(1):50, 1945.</span></p><p><span style="font-family: arial;">[15] Hendrik A Kramers and Gregory H Wannier. Statistics of the two-dimensional ferromagnet.</span><span style="font-family: arial;">Part I.</span><span style="font-family: arial;"> </span><span style="font-family: arial;">Physical Review</span><span style="font-family: arial;">, 60(3):252, 1941.</span></p><p><span style="font-family: arial;">[16] Hendrik A Kramers and Gregory H Wannier. Statistics of the two-dimensional ferromagnet.</span><span style="font-family: arial;">Part II.</span><span style="font-family: arial;"> </span><span style="font-family: arial;">Physical Review</span><span style="font-family: arial;">, 60(3):263, 1941.</span></p><p><span style="font-family: arial;">[17] C van der Malsburg. Frank Rosenblatt: principles of neurodynamics: perceptrons and the </span><span style="font-family: arial;">theory of brain mechanisms. In</span><span style="font-family: arial;"> </span><span style="font-family: arial;">Brain theory</span><span style="font-family: arial;">, pages 245–248. Springer, 1986.</span></p><p style="text-align: left;"><span style="font-family: arial;">[18] J. Schmidhuber. Deep learning in Neural Networks: An overview. Neural networks, 61:85–</span><span style="font-family: arial;">117, 2015. & </span><span style="font-family: arial;"><span style="background-color: white; caret-color: rgb(56, 64, 67);">Yoshua Bengio, Yann Lecun, Geoffrey Hinton, </span></span><span style="background-color: white; caret-color: rgb(56, 64, 67); font-family: arial;">Communications of the ACM, July 2021, Vol. 64 No. 7, Pages 58-65 (2021) <a href="https://cacm.acm.org/magazines/2021/7/253464-deep-learning-for-ai/fulltext">link</a></span></p><p><span style="font-family: arial;">[19] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks. Nature,</span><span style="font-family: arial;">393(6684):440, 1998.</span></p><p><span style="font-family: arial;">[20] Donald Olding Hebb. The Organization of Behavior: a Neuropsychological Theory. J. Wiley;</span><span style="font-family: arial;">Chapman & Hall, 1949.</span></p><p><span style="font-family: arial;">[21] Mehmet Suezen. Effective ergodicity in single-spin-flip dynamics. Physical Review E, </span><span style="font-family: arial;">90(3):032141, 2014.</span></p><p><span style="font-family: arial;">[22] Mehmet Suezen. Anomalous diffusion in convergence to effective ergodicity. arXiv preprint </span><span style="font-family: arial;">arXiv:1606.08693</span><span style="font-family: arial;">, 2016.</span></p><p><span style="font-family: arial;">[23] Mehmet Suezen, Cornelius Weber, and Joan J Cerda. Spectral ergodicity in deep learning </span><span style="font-family: arial;">architectures via surrogate random matrices.</span><span style="font-family: arial;"> </span><span style="font-family: arial;">arXiv preprint arXiv:1704.08303</span><span style="font-family: arial;">, 2017.</span></p><p><span style="font-family: arial;">[24] Mehmet Suezen, JJ Cerda, and Cornelius Weber. Periodic Spectral Ergodicity: A Complexity Measure for Deep Neural Networks and Neural Architecture Search. arXiv preprint </span><span style="font-family: arial;">arXiv:1911.07831</span><span style="font-family: arial;">, 2019.</span></p><p><br /></p><p><span style="font-family: arial;"><b>Postscript 1:</b></span></p><p><span style="font-family: arial;"><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">(Deep) Machine learning as a subfield of statistical physics</span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">Often researchers considers some machine learning methods</span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">under different umbrella terms compare to established</span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">statistical physics. However, beyond being mare analogy, </span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">application of these methods are quite striking. Consequently,</span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">there is a great tradition in machine learning practice </span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">of being sub-field of statistical physics with explicit</span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">classification within PACS. </span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">Hopfield Networks <- Ising-Lenz model</span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">Boltzmann Machines <- Sherrington-Kirkpatrick model</span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">Diffusion Models <- Langevin Dynamics, Fokker-Planck Dynamics</span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">Softmax <- Boltzmann-Gibbs connection to partition function </span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">Energy Based Models <- Spin-glasses, Hamiltonian dynamics</span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">For this reason, we provide semi-formal mathematical definitions</span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">in the recent article, establishing that deep learning architectures </span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">should be called Ising-Lenz Architectures (ILAs), akin to calling </span><br style="box-sizing: inherit; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px; line-height: inherit !important;" /><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); color: rgba(0, 0, 0, 0.9); font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 14px;">current computers having von Neumann architectures.</span></span></p></div></div></div></div></div></div></div></div></div></div>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0tag:blogger.com,1999:blog-4550553973032503669.post-17825369078512131552020-12-03T08:25:00.001-08:002020-12-04T11:47:32.830-08:00Resolution of the dilemma in explainable Artificial Intelligence: Who is going to explain the explainer?<p></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: center;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Infinite_regress_en.svg/440px-Infinite_regress_en.svg.png" style="margin-left: auto; margin-right: auto;"><img alt="Infinite Regress" border="0" data-original-height="709" data-original-width="440" height="320" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d4/Infinite_regress_en.svg/440px-Infinite_regress_en.svg.png" title="Infinite Regress" width="198" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"> Figure: Infinite<br />Regress (Wikipedia)<br /></td></tr></tbody></table><span style="font-family: arial;"><div style="text-align: justify;"><b>Preamble</b> </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Surge in usage of artificial intelligence (AI) systems, now a standard practice for mid to large scale industries. These systems can not reason by construction and <a href="https://en.wikipedia.org/wiki/General_Data_Protection_Regulation">the legal requirements</a> dictates if a machine learning/AI model made a decision, such as granting a loan or not for example, people affected by this decision has right to know <i>the reason</i>. However, it is well known that machine learning models can not reason or provide a reasoning out of box. Apart from modifying conventional machine learning systems that includes some form of reasoning as a research exercise, practicing or building so called explainable or interpretable machine learning solutions are very popular on top of conventional models. Though there is no accepted definition of what should entail an explanation of the machine learning systems, but in general, this field of study is called <a href="https://en.wikipedia.org/wiki/Explainable_artificial_intelligence">explainable artificial intelligence</a>.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">One of the most used or popularised set of techniques essentially build a secondary model on top of the primary model's behaviour and try to come up with a story on how the primary model, AI system, brought up its answers. However, this approach sounds like a good solution at the first glance, it actually trapped us into an infinite regress, a dilemma: <i>Who is going to explain the explainer?</i></div><div style="text-align: justify;"><br /></div><div style="text-align: justify;"><b>Avoiding 'Who is going to explain the explainer?' dilemma</b></div><div style="text-align: justify;"><b><br /></b></div><div style="text-align: justify;">Resolution of this lies in completely avoiding explainer models or techniques rely on optimisations of a similar sort. We should rely on solely so called <i><b>counterfactual generators</b></i>. These generators rely on a repetitive query to the system to generate data on the behaviour of the AI system to answer <b>what if </b>scenarios or a set of what if scenarios, corresponding to a set of <i>reasoning statements</i>.<b> </b></div><div style="text-align: justify;"><b><br /></b></div><div style="text-align: justify;"><span style="background-color: white; caret-color: rgb(34, 34, 34); color: #222222; font-family: Arial, Helvetica, sans-serif; text-align: start;"><b>What are counterfactual generators?</b></span></div><div style="text-align: justify;"><span style="background-color: white; caret-color: rgb(34, 34, 34); color: #222222; font-family: Arial, Helvetica, sans-serif; font-size: small; text-align: start;"><b><br /></b></span></div><div style="text-align: justify;"><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: right; margin-left: 1em; text-align: right;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjv1erqYeqhaSY3FCGIN1bUdLEDrOyxtVzlEcq-VYPC4hnS2iz0vQYui2yndOW-GDx72zoQMY2O83ir-aog5l2RrsAP1PMK-d9ah17FCXVdikRvOiGJPn_bJGw_KsQAUDu9tRjTEYuyZy5w/s1322/Screenshot+2020-12-04+at+20.44.24.png" imageanchor="1" style="clear: left; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="816" data-original-width="1322" height="198" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjv1erqYeqhaSY3FCGIN1bUdLEDrOyxtVzlEcq-VYPC4hnS2iz0vQYui2yndOW-GDx72zoQMY2O83ir-aog5l2RrsAP1PMK-d9ah17FCXVdikRvOiGJPn_bJGw_KsQAUDu9tRjTEYuyZy5w/w320-h198/Screenshot+2020-12-04+at+20.44.24.png" width="320" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Figure: Counterfactual generator, <br />instance based.<br /><br /></td></tr></tbody></table><span style="background-color: white; caret-color: rgb(34, 34, 34); color: #222222; text-align: start;">These are techniques that can generate a counter factual statement on the predicted machine learning decision. For example for a loan approval model, a counterfactual statement would be <i>'If</i></span><span style="background-color: white; caret-color: rgb(34, 34, 34); color: #222222; text-align: start;"><i> </i></span><span style="background-color: white; caret-color: rgb(34, 34, 34); color: #222222; text-align: start;"><i>applicants income was 10K more a model would have approved the loan</i>". </span><span style="background-color: white; caret-color: rgb(34, 34, 34); color: #222222; text-align: start;">A simplest form of counterfactual generator one can think of is </span><span style="text-align: left;">Individual Conditional Expectation (</span><span style="background-color: white; caret-color: rgb(34, 34, 34); color: #222222; text-align: start;">ICE) curves </span><span style="background-color: white; caret-color: rgb(34, 34, 34); color: #222222; text-align: start;"><i>[ Goldstein2013 ],</i> ICE curves shows, what would happen to model decision if one of the feature, such as income, vary over set of values. The idea is simple but it is so powerful that, one can generate dataset for counterfactual reasoning, so the name counterfactual generator. </span><span style="background-color: white; caret-color: rgb(34, 34, 34); color: #222222; text-align: start;">These are classified as model agnostic methods in general <i>[ Du2020, Molnar ] </i>but distinction here we are trying to make is avoiding building another model to explain the primary model but we solely rely on queries to the model. </span><span style="background-color: white; text-align: start;"><span style="color: #222222;">This rules out LIME, as it relies on building models to explain the model, we question that if linear regression is intrinsically explainable here </span></span><i style="background-color: white; caret-color: rgb(34, 34, 34); color: #222222; text-align: start;">[Lipton]</i><span style="background-color: white; caret-color: rgb(34, 34, 34); color: #222222; text-align: start;">. One </span><span style="color: #222222;">extension to ICE is generating a falling list [ wang14 ] outputs without building models.</span></div><div style="text-align: justify;"><span style="color: #222222;"><span style="caret-color: rgb(34, 34, 34);">. </span></span></div><div style="text-align: start;"><span style="color: #222222;"><span style="background-color: white; caret-color: rgb(34, 34, 34);"> </span></span></div><div style="text-align: justify;"><b>Outlook</b></div><div style="text-align: justify;"><b><br /></b></div><div style="text-align: left;">We rule out of using secondary machine learning models or any models, including simple linear regression, in building an explanation for machine learning system. Instead we claim that <i>reasoning</i> can be achieved a simplest level with <b>counterfactual generators</b> based on systems behaviour to different query sets. This seems to be a good direction, as reasoning can be defined as "<span style="font-style: italic; text-align: left;">algebraically manipulating previously acquired knowledge in order to answer a new question</span>" by <span style="text-align: left;"><a href="https://leon.bottou.org">Léon Botton</a> <i>[ Botton ]</i> and of course partly inline with <a href="https://amturing.acm.org/award_winners/pearl_2658896.cfm">Judea Pearl's causal inference revolution</a>, though replacing the machine learning model with the causal model completely would be more causal inference recommendation.</span></div><div style="text-align: justify;"><b><br /></b></div><div style="text-align: justify;"><b>References and further reading</b></div><div style="text-align: justify;"><b><br /></b></div><div style="text-align: justify;"><span style="background-color: white; caret-color: rgb(34, 34, 34); color: #222222; text-align: start;">[ Goldstein2013 ] </span><span style="text-align: left;">Peeking Inside the Black Box: Visualising Statistical Learning with Plots of Individual Conditional Expectation, Goldstein et. al. <a href="https://arxiv.org/abs/1309.6392">arXiv</a></span></div><div style="text-align: justify;">[ Lipton ] <span style="font-family: "Lucida Grande", Helvetica, Arial, sans-serif; text-align: left;">The Mythos of Model Interpretability, Z. Lipton <a href="https://arxiv.org/abs/1606.03490">arXiv</a></span></div><div style="text-align: justify;">[ Molnar ] Interpretable ML book, C. Molnar <a href="https://christophm.github.io/interpretable-ml-book/">url</a></div><div style="text-align: justify;"><span style="background-color: white; caret-color: rgb(34, 34, 34); color: #222222; text-align: start;">[ Botton ] </span><span style="text-align: left;">From machine learning to machine reasoning </span><span style="text-align: left;">An essay, </span><span style="text-align: left;">Léon Bottou </span><a href="http://dx.doi.org/10.1007/s10994-013-5335-x">doi</a></div><div style="text-align: justify;">[ Du2020 ] <span style="font-family: Arial, Helvetica, sans-serif; text-align: left;">Techniques for Interpretable Machine Learning, Du et. al, <a href="https://cacm.acm.org/magazines/2020/1/241703-techniques-for-interpretable-machine-learning/fulltext">doi</a></span></div><div style="text-align: justify;">[ wang14 ] <span style="text-align: left;">Falling Rule Lists, Wang-Rudin <a href="https://arxiv.org/abs/1411.5899">arXiv</a></span></div></span><div><p><br /></p></div>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com1tag:blogger.com,1999:blog-4550553973032503669.post-23973550345773910752020-11-30T12:35:00.008-08:002020-11-30T14:55:10.821-08:00 Re-discovery of Inverse problems: What is underspecification for machine learning models?<p></p><table cellpadding="0" cellspacing="0" class="tr-caption-container" style="float: left; margin-right: 1em; text-align: left;"><tbody><tr><td style="text-align: center;"><a href="https://upload.wikimedia.org/wikipedia/commons/9/9b/Johann_Radon.png" style="clear: right; margin-bottom: 1em; margin-left: auto; margin-right: auto;"><img border="0" data-original-height="800" data-original-width="466" height="200" src="https://upload.wikimedia.org/wikipedia/commons/9/9b/Johann_Radon.png" width="117" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Radon, founder of <br />inverse problems (Wikipedia)</td></tr></tbody></table><span style="font-family: arial;"><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><br />This is a very well known concept in geophysics to image reconstruction communities many decades. Underspecification stems from Hadamard's definition of<a href="https://en.wikipedia.org/wiki/Well-posed_problem"> well-posed problem.</a> It isn't a new problem. If you do a research on <i>underspecification for machine learning, </i>please do make sure that relevant literature on ill-posed problems are studied well before making strong statements. It would be helpful and prevent the reinvention of the wheel.</span><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"> </span><span color="rgba(0, 0, 0, 0.9)" face="-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, "Helvetica Neue", "Fira Sans", Ubuntu, Oxygen, "Oxygen Sans", Cantarell, "Droid Sans", "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol", "Lucida Grande", Helvetica, Arial, sans-serif" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"> </span></span><p></p><span style="background-color: white;"><span face="-apple-system, system-ui, BlinkMacSystemFont, Segoe UI, Roboto, Helvetica Neue, Fira Sans, Ubuntu, Oxygen, Oxygen Sans, Cantarell, Droid Sans, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Lucida Grande, Helvetica, Arial, sans-serif"><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9); font-family: arial;">One technique everyone aware of is<a href="https://en.wikipedia.org/wiki/Tikhonov_regularization"> L2 regularisation</a>, this is to reduce ill-possedness of machine learning models. In the context of how come a deployed model's performance degrade over time, ill-possedness play a role but it isn't the sole reason. There is a large literature on <a href="https://en.wikipedia.org/wiki/Inverse_problem">inverse problems</a> dedicated to solve these issues, and </span></span></span><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><span style="font-family: arial;">if underspecification was the sole issue for deployed machine learning systems degrading over time: we would have reduced the performance degradation by applying strong <a href="https://en.wikipedia.org/wiki/Lasso_(statistics)">L1-regularisations </a>to reduce "<i>the feature selection bias</i>", hence the lower the effect of underspecification. Specially in deep learning models, underspecification shouldn't be an issue, due to <a href="https://en.wikipedia.org/wiki/Feature_learning">representation learning</a> deep learning models bring naturally, given the </span></span><span style="background-color: white;"><span color="rgba(0, 0, 0, 0.9)" style="caret-color: rgba(0, 0, 0, 0.9);"><span style="font-family: arial;">inputs covers the basic learning space. </span></span></span><div><span style="font-family: arial;"><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><br /></span></span></div><div><span style="font-family: arial;"><span style="background-color: white; caret-color: rgba(0, 0, 0, 0.9);"><br /></span></span><div><div><br /></div><div><span face="-apple-system, system-ui, BlinkMacSystemFont, Segoe UI, Roboto, Helvetica Neue, Fira Sans, Ubuntu, Oxygen, Oxygen Sans, Cantarell, Droid Sans, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol, Lucida Grande, Helvetica, Arial, sans-serif"><span color="rgba(0, 0, 0, 0.9)" style="background-color: white; caret-color: rgba(0, 0, 0, 0.9); font-size: 14px;"><br /></span></span></div><div><br /></div></div></div>msuzenhttp://www.blogger.com/profile/06434797231632063088noreply@blogger.com0