Scientific Memo

Tuesday, 4 October 2022

Heavy-matter-wave and ultra-sensitive interferometry: An opportunity for quantum-gravity becoming an evidence based research

Solar Eclipse of 1919
(wikipedia)

Preamble

Cool ideas in theoretical physics are ofter opaque for general reader whether if they are backed up with any experimental evidence in the real world. The success of LIGO (Laser Interferometer Gravitational-wave Observatory) definitely proven the value of interferometry for advancement of cool ideas of theoretical physics supported by real world measurable evidence. An other type of interferometry that could be used in testing multiple-different ideas from theoretical physics is called matter-wave interferometry or atom interferometry: It's been around decades but the new developments and increased sensitivity with measurement on heavy atomic system-waves will pave the technical capabilities to test multiple ideas of theoretical physics.

Basic mathematical principle of interferometry

Usually interferometry is explained with device and experimental setting details that could be confusing. However, one could explain the very principle without introducing any experimental setup. The basic idea of of interferometry is that if a simple wave, such as $\omega(t)=\sin\Theta(t)$, is first split into two waves and reflected over the same distance, one with shifted with a constant phase, in the vacuum without any interactions. A linear combination of the returned waves $\omega_{1}(t)=\sin \Theta(t)$ and $\omega_{2}(t)=\sin( \Theta(t) + \pi))$, will yield to zero, i.e., an interference pattern generated by $\omega_{1}(t)+\omega_{2}(t)=0$. This very basic principle can be used to detect interactions and characteristics of those interactions wave encounter over the time it travels to reflect and come back. Of course, the basic wave used in many interferometry experiments is the laser light and interaction we measure could be gravitational wave that interacts with the laser light i.e., LIGO's set-up.

Detection of matter-waves : What is heavy and ultra-sensitivity?

Each atomic system exhibits some quantum wave properties, i.e., matter waves. It implies a given molecular system have some wave signatures-characteristics which could be extracted in the experimental setting. Instead of laser light, one could use atomic system that is reflected similar to the basic principle. However, the primary difference is that increasing mass requires orders of magnitude more sensitive wave detectors for atomic interferometers. Currently heavy means usually above ~$10^{9}$ Da (comparing to Helium-4 which is about ~4 Da), these new heavy atomic interferometers might be able to detect gravitational-interactions within quantum-wave level due to precisions achieved ultra-sensitive. This sounds trivial but experimental connection to theories of quantum-gravity, one of the unsolved puzzles in theoretical-physics appears to be a potential break-through. One prominent example in this direction is entropic gravity and wave-function collapse theories.

Conclusion

Recent developments in heavy matter-wave interferometry could be leveraged for testing quantum-gravity arguments and theoretical suggestions. We try to bring this idea into general attention without resorting in describing experimental details.

Further Reading & Notes

Dalton, mass-unit used in matter-wave interferometry.
Atom Interferometry by Prof. Pritchard YouTube.
Newton-Schrödinger equation.

Papers of Kingsley R. W. Jones are also very novel in this direction.

A roadmap for universal high-mass matter- wave interferometry Kilka et. al. AVS Quantum Sci. 4, 020502 (2022). doi

Current capabilities as of 2022, atom interferometers can reach up to ~300 kDa.

Testing Entropic gravity, arXiv.
NASA early stage ideas workshops : web-archive

Tuesday, 20 September 2022

Building robust AI systems: Is an artificial intelligent agent just a probabilistic boolean function?

Preamble

George Boole (Wikipedia)

Agent, AI agent or an intelligent agent is used often to describe algorithms or AI systems that are released by research teams recently. However, the definition of an intelligent agent (IA) is a bit opaque. Naïvely thinking, it is nothing more than a decision maker that shows some intelligent behaviour. However, making a decision intelligently is hard to quantify computationally, and probably IA for us is something that can be representable as a Turing machine. Here, we argue that an intelligent agent in the current AI systems should be seen as a function without side effects outputting a boolean output and shouldn't be extrapolated or compare to human level intelligence. Causal inference capabilities should be seen as a scientific guidance to this function decompositions without side-effects, i.e., Human in-the loop Probabilistic Boolean Functions (PBFs).

Computational learning theories are based on binary learners

Two of the major theories of statistical learning PAC and VC dimensions build upon on "binary learning".

PAC stands for Probably Approximately Correct, It sets basic framework and mathematical building blocks for defining a machine learning problem from complexity theory. Probably correct implies finding a weak learning function given binary instance set $X=\{1,0\}^{n}$. The binary set or its subsets mathematically called concepts and under certain mathematical conditions a system said to be PAC learnable. There are equivalences to VC and other computation learning frameworks.

Robust AI systems: Deep reinforcement learning and PAC

Even though the theory of learning on deep (reinforcement) learning is not established and active area of research. There is an intimate connection with composition of concepts, i.e., binary instance subsets as almost all operations within deep RL can be viewed as probabilistic Boolean functions (PBFs).

Conclusion

Current research and practice in robust AI systems could focus on producing learnable probabilistic boolean functions (PBFs) as intelligent agents, rather than being a human level intelligent agents. This modest purpose might bring more practical fruit than long-term aims of replacing human intelligence. Moreover, theory of computation for deep learning and causality could benefit from this approach.

Further reading

Valiant84. Theory of Learnable.
VC Dimension.
Modern Theory and Machine Learning, Chase-Freitag, 2018

Tuesday, 5 July 2022

Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra

Preamble

The White Rabbit
(Wikipedia)

A novice analyst or even experienced (data) scientist would have thought that the bar notation $|$ in representing conditional probability carries some different operational mathematics. Primarily when written in explicit distribution functions $p(x|y)$. Similar approach applies to joint probabilities such as $p(x, y)$ too. One could see a mixture of these, such as $p(x, y | z)$. In this short exposition, we clarify that none of these identifications within arguments of probability do have any different resulting operational meaning.

Arguments in probabilities: Boolean statement and filtering

Arguments in any probability are mathematical statements of discrete mathematics that correspond to events in the experimental setting. These are statements declaring some facts with a boolean outcome. These statements are queries to a data set. Such as, if the temperature is above $30$ degrees, $T > 30$. Temperature $T$ is a random variable. Unfortunately, the term random variable is often used differently in many textbooks. It is defined as a mapping rather than as a single variable. The bar $|$ in conditional probability $p(x|y)$, implies statement $x$ given that statement $y$ has already occurred, i.e., if. This interpretation implies that $y$ first occurred before $x$, but it doesn't imply that they are causally linked. The condition plays a role in filtering, a where clause in query languages. $p(x|y)$ boils down to $p_{y}(x)$, where the first statement $y$ is applied to the dataset before computing the probability on the remaining statement $x$.

In the case of joint probabilities $p(x, y)$, events co-occur, i.e., AND statement. In summary, anything in the argument of $p$ is written as a mathematical statement. In the case of assigning a distribution or a functional form to $p$, there is no particular role for conditionals or joints; the modelling approach sets an appropriate structure.

Conditioning does not imply casual direction: do-Calculus do

A filtering interpretation of conditional $p(x|y)$ does not imply causal direction, but $do$ operator does, $p(x|do(y))$.

Non-commutative algebra: When frequentist are equivalent to Bayesian

Most of the simple filtering operations would result in identical results if reversed. $p(x|y) = p(y|x)$, prior being equal to posterior. This remark implies we can't apply Bayesian learning with commutative statements. We need non-commutative statements; as a result, one can do Bayesian learning with the newly arriving data, i.e., the arrival of new subjective evidence. The reason seems to be due to the frequentist nature of filtering.

Outlook

Even though we provided some revelations on decoding the operational meaning of conditional probabilities, we suggested that any conditional, joint or any combination of these within the argument of probabilities has no operational purpose other than pre-processing steps. However, the philosophical and practical implications of probabilistic reasoning are always counterintuitive. Probabilistic reasoning is a complex problem computationally. From a causal inference perspective, we are better equipped to tackle these issues with do-Bayesian analysis.

Further reading

Discrete Mathematics, Oscar Levin
Causal Inference in Statistics, A Primer, Judea Pearl, Madelyn Glymour, Nicholas P. Jewell
Indicative Conditionals Stanford Encyclopedia of Philosophy
Bayesian Epistemology Stanford Encyclopedia of Philosophy

Please Cite as:

@misc{suezen22brh,

title = {Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra},

howpublished = {\url{https://science-memo.blogspot.com/2022/07/bayesian-conditional-noncommutative.html}},

author = {Mehmet Süzen},

year = {2022}

}

Monday, 20 June 2022

Empirical risk minimization is not learning :
A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning

Simionescu Function (Wikipedia)

Preamble

The holy grail of machine learning appears to be the empirical risk minimisation. However, on the contrary to general dogma, the primary objective of machine learning is not risk minimisation per se but mimicking human or animal learning. Empirical risk minimisation is just a snap-shot in this direction and is part of a learning measure, not the primary objective.

Unfortunately, all current major machine learning libraries are implementing empirical risk minimisation as primary objective, so called a training, manifest as usually .fit. Here we provide a mathematical definition of learning in the language of empirical risk minimisation and its implications on two very important concepts, overfitting and Occam's razor.

Our exposition is still informal but it should be readable for experienced practitioners.

Definition: Empirical Risk Minimization

Given set of $k$ observation $\mathscr{O} = \{o_{1}, ..., o_{k} \}$ where $o_{i} \in \mathbb{R}^{n}$, $n$-dimensional vectors. Corresponding labels or binary classes, the set $\mathscr{S} = \{ s_{1}, .., s_{k}\}$, with $s_{i} \in \{0,1\}$ is defined. A function $g$ maps observations to classes $g: \mathscr{O} \to \mathscr{S}$. An error function (or loss) $E$ measures the error made by the estimated map function $\hat{g}$ compare to true map function $g$, $E=E(\hat{g}, g)$. The entire idea of supervised machine learning boils down to minimising a functional called ER (Empirical Risk), here we denoted by $G$, it is a functional, meaning is a function of function, over the domain $\mathscr{D} = Tr(\mathscr{O} x \mathscr{S})$ in discrete form, $$ G[E] = \frac{1}{k} {\Large \Sigma}_{\mathscr{D} } E(\hat{g}, g) $$. This is so called a training a machine learning model, or an estimation for $\hat{g}$. However, testing this estimate on the new data is not the main purpose of the learning.

Definition: Learning measure

A learning measures $M$, on $\hat{g}$ is defined over set of $l$ observations with increasing size, $\Theta = \{ \mathscr{O}_{1}, ..., \mathscr{O}_{l}\}$ whereby size of each set is monotonically higher, meaning that $ | \mathscr{O}_{1}| < | \mathscr{O}_{2}| , ...,< | \mathscr{O}_{l}|$.

Definition: Empirical Risk Minimization with a learning measure (ERL)

Now, we are in a position to reformulate ER with learning measure, we call this ERL. This come with a testing procedure.

If empirical risks $G[E_{j}]$ lowers monotonically, $ G[E_{1}] > G[E_{2}] > ... > G[E_{l}]$, then we said the functional form of $\hat{g}$ is a learning over the set $\Theta$.

Functional form of $\hat{g}$ : Inductive bias

The functional form implies a model selection, and a technical term of this also known as inductive bias with other assumptions, meaning the selection of complexity of the model, for example a linear regression or nonlinear regression.

Re-understanding of overfitting and Occam's razor from ERL perspective

If we have two different ERLs on $\hat{g}^{1}$ and $\hat{g}^{2}$. Then overfitting is a comparison problem between monotonically increasing empirical risks. If model, here an inductive bias or a functional form, over learning measure, we select the one with "higher monotonicity" and the less complex one and call the other overfitted model. Complexity here boils down to functional complexity of $\hat{g}^{1}$ and $\hat{g}^{2}$ and overfitting can only be tested with two models over monotonicity (increasing) of ERLs.

Conclusions

In the age of deep learning systems, the classical learning theory needs an update on how do we define what is learning beyond a single shot fitting exercise. A first step in this direction would be to improve upon basic definitions of Empirical Risk (ER) minimisation that would reflect real-life learning systems similar to forgetting mechanism proposed by Ebbinghaus. This is consistent with Tom Mitchell's definition of operational machine learning. A next level would be to add causality in the definition.

Please cite as follows:

@misc{suezen22erm,

title = { Empirical risk minimization is not learning : A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning},

howpublished = {\url{http://science-memo.blogspot.com/2022/06/empirical-risk-minimisation-learning-curve.html}},

author = {Mehmet Süzen},

year = {2022}

}

Postscript Notes

Following notes are added after initial release

Postscript 1: Understanding overfitting as comparison of inductive biases

ERM could be confusing for even experienced researchers. It is indeed about risk measure. We measure the risk of a model, i.e., machine learning procedure that how much error would it make on the given new data distribution, as in risk of investing. This is quite a similar notion as in financial risk of loss but not explicitly stated.

Moreover, a primary objective of machine learning is not ERM but measure learning curves and pair-wise comparison of inductive biases, avoiding overfitting. An inductive bias, here we restrict the concept as in model type, is a model selection step: different parametrisation of the same model are still the same inductive bias. That’s why standard training-error learning curves can’t be used to detect overfitting alone.

Postscript 2: Learning is not to optimise: Thermodynamic limit, true risk and accessible learning space

True risk minimisation in machine learning is not possible, instead we rely on ERM, i.e., Emprical Risk Minimisation. However, the purpose of machine learning algorithm is not to minimise risk, as we only have a partial knowledge about the reality through data. Learning implies finding out a region in accessible learning space whereby there is a monotonic increase in the objective; ERM is only a single point on this space, the concept rooted in German scientist Hermann Ebbinghaus work on memory.

There is an intimate connection to thermodynamic limit and true risk in this direction as an open research. However, it doesn’t imply infinite limit of data, but the observable’s behaviour. That’s why full empiricist approaches usually requires a complement of a physical laws, such as Physics Informed Neural Networks (PINNs) or Structural Causal Model (SCM).

Postscript 3: Missing abstraction in modern machine learning libraries

Interestingly current modern machine learning libraries stop abstracting further than fitting: .fit and .predict. This is short of learning as in machine learning. Learning manifest itself In learning curves. .learn functionality can be leveraged beyond fitting and if we are learning via monotonically increasing performance. Origin of this lack of tools for .learn appears to be how Empirical Risk Minimisation (ERM) is formulated on a single task.

Scientific Memo

Tuesday, 4 October 2022

Heavy-matter-wave and ultra-sensitive interferometry: An opportunity for quantum-gravity becoming an evidence based research

Tuesday, 20 September 2022

Building robust AI systems: Is an artificial intelligent agent just a probabilistic boolean function?

Tuesday, 5 July 2022

Bayesian rabbit holes: Decoding conditional probability with non-commutative algebra

Monday, 20 June 2022

Empirical risk minimization is not learning :
A mathematical definition of learning and re-understanding of overfitting and Occam's razor in machine learning

Mehmet Suzen

Related

(c) Copyright 2008-2024 Mehmet Suzen (suzen at acm dot org)