Scientific Memo

Showing posts with label Scientific Computing. Show all posts

Thursday, 1 April 2021

Shifting Modern Data Science Forward: Dijkstra principle for data science

Kindly reposted to KDnuggets by Gregory Piatetsky-Shapiro with the title Data science is not about data -applying Dijkstra principle to data science and enhancements.

Prelude

Dijkstra in Zurich, 1984 (Wikipedia)

Edsger Dijkstra was a Dutch theoretical physicist turned computer scientist, and probably one of the most influential earlier pioneers in the field. He had deep insight in what is computer science and well founded notion of how should it be taught in academics. In this post we extrapolate his ideas into data science. We developed something called, Dijkstra principle for data science, that is driven by his ideas on what does computer science entails.

Computer Science and Astronomy

Astronomy is not about telescopes. Indeed, it is about how universe works and how its constituent parts are interacting. Telescopes, either being optical or radio observations or similar detection techniques are merely tools to practice and do investigation for astronomy. A formed analogy goes into computer science as well, this is the quote from Dijkstra:

Computer science is no more about computers than astronomy is about telescopes. - Edsger Dijkstra

The idea of Computer Science being not about computer is rather strange in the first instance. However, what Dijkstra had in mind is abstract mechanism and mathematical constructs that one can map real problems and solve it as a computer science problem, such as graph algorithms. Though Computer Science had a lot of subfields but its inception can be considered as rooted in applied mathematics.

Dijkstra principle for data science

By using Dijkstra's approach now we are in position to formulate a principle for data science.

Data science is no more about data than computer science is about computers. -Dijkstra principle for data science

This sounds absurd. If data science is not about data, then what is it about? Apart from definition of data science as an emergent field, as an amalgamation of multiple fields from statistics to high performance computing, the idea that data not being the core tenant of data science implies the practice does not aim at data itself rather a higher purpose. Data is used similar to a telescope in astronomy, the purpose is to reveal the empirical truths about representations data conveys. There is no unique ways to achieve this purpose.

Conclusive Remarks

Dijkstra principle for data science would be very helpful in understanding the data science practice as not data-centric, contrary to mainstream dogma, rather as a science-centric practice with the data being the primary tool to leverage, using multitude of techniques. Implication is that machine learning is a secondary tool on top of data in practicing data science. This attitude would help causality playing a major role shifting modern data science forward.

Monday, 1 August 2016

Understanding the empirical law of large numbers and the gambler's fallacy

Platonic dices (Wikipedia).

One of the misconceptions in our understanding of

statistics, or a counter-intuitive guess, fallacy, appears in the assumption of the existence of the law of averages. Imagine we toss a fair coin many times, most people would think that the number of heads and tails would be balanced over the increasing number of trails, which is wrong. If you don't, then you might have a very good statistical intuition. Briefly, we will illustrate this, a kind of gambler's fallacy with a simple simulation approach and discuss the empirical law of large numbers.

Empirical law of large numbers

If we repeat an experiment long enough, we would approach to expected outcome. The simplest example is a coin-toss experiment, that an expected
fair coin toss would lead to equal likelihood for head and tail, 1 or 0. This implies that, the ratio of head and tails will approach to one with increasing number of repeats. Let's say, we toss the coin $N$ times. The number of heads and tails would be $n_{1}$ and $n_{0}$. The emprical law of large numbers states

$ \frac{n_{1}}{n_{0}} \to 1.0 $ if $N$ is very large, $N >>1$

But note that, the absolute difference, $|n_{1}-n_{0}|$ does not approach to any constant, on the contrary, it will increase with increasing number of repeats. This is classic example of gambler's fallacy that an outcome would balance out as there are more repeats.

Fair coin-toss: Evolution of Bernoulli process

The Bernoulli process expresses binary outcomes, 1 or 0, i.e., success or failure, true or false. Bernoulli distribution reads,

$Ber(p, k) = p^{k} (1-p)^{1-k}$ for $k \in \{0,1\}$.

Figure 2: No law of averages. Absolute difference
of occurances
increases over repeats.

Figure 1: Empirical law of large numbers, ratio of
occurances approach to a constant.

$p$ is the probability of success. We draw 50K samples from this distribution to get a Bernoulli process with $p=0.5$ and repeat the experiment 50 times, in order to obtain a "generalised" behaviour with uncertainty. This situation corresponds to a fair coin-toss experiment.

Results

Empirical results of ratio of two outcomes and their absolute difference over repeats are reported in Figure 1 and 2 respectively..

Appendix: Source codes

R and Rcpp functions are shown in this section to reproduce the plots in this post. Source files are also available on github (here).

#'
#'  Misconception of the law of averages: Understanding the empirical law of large numbers
#' 
#'  (c) 2016 Mehmet Suzen 
#'  

rm(list=ls())
set.seed(987231)
library(Rcpp)          # law level implementations
library(LaplacesDemon) # for drawing samples from Bernoulli distribution
library(matlab)        # for tic/toc
# Set working directory to script directory 
Rcpp:::sourceCpp("empirical_law_of_large_numbers.cpp") # diff_binary_vec2
#'
#' Calculate the running difference in a binary vector 
#' 
diff_binary_vec  <- function(bvec) {
 ll <- length(bvec) 
 diff_vec  <- vector("double", ll)
 diff_vec  <- sapply(1:ll, function(i) { abs(i-2.0*sum(bvec[1:i])) } )
 diff_vec
}
#'
#' Calculate the running ratio in a binary vector 
#'
ratio_binary_vec  <- function(bvec) {
 ll         <- length(bvec) 
 ratio_vec  <- vector("double", ll)
 ratio_vec  <- sapply(1:ll, function(i) { abs(i/sum(bvec[1:i])-1.0) } )
 ratio_vec  <- sapply(ratio_vec, function(rv) {if(is.infinite(rv)) { rv <- 0; }; rv })
 ratio_vec
}

#' Wall-call timing difference 
tb <- rbern(20000, 0.5) # a fair-coin
tic()
t1 <- diff_binary_vec(tb)
toc()
tic()
t2 <- diff_binary_vec2(tb)
toc()
tic()
r1 <- ratio_binary_vec(tb)
toc()
tic()
r2 <- ratio_binary_vec2(tb)
toc()
#' 
#' Generate Bernoulli Process
#' 
nr        <- 50    # repeats
nt        <- 20000 # Bernoulli trails
tic()
bern_df <- data.frame(trail=c(), diff=c(), ratio=c())
for(i in 1:nr) {
  cat("repeat:",i, "\n")
  trail   <- rbern(nt, 0.5) # a fair-coin
  diff    <- diff_binary_vec2(trail)
  ratio   <- ratio_binary_vec2(trail)
  bern_df <- rbind(bern_df, cbind(1:nt, diff, ratio))
}

#' Now plot ratio and diff evolution with local regression in ggplot2
names(bern_df) <- c("trail", "diff", "ratio")
library(ggplot2)
p_diff <- ggplot(data=bern_df, aes(x=trail, y=diff)) + geom_smooth(formula="y~x") + 
          theme(
                panel.background = element_blank(), 
                axis.text.x      = element_text(face="bold", color="#000000", size=11),
                axis.text.y      = element_text(face="bold", color="#000000", size=11),
                axis.title.x     = element_text(face="bold", color="#000000", size=11),
                axis.title.y     = element_text(face="bold", color="#000000", size=11))  +
           xlab("Bernoulli Trails") + ylab("Difference between occurance of two outcomes") +
           ggtitle("No Law of Averages: Tail/Heads do not balanced out!")

png(file="no_law_of_averages.png")
p_diff
dev.off()

p_ratio <- ggplot(data=bern_df, aes(x=trail, y=ratio)) + geom_smooth(formula="y~x") + 
          theme(
                panel.background = element_blank(), 
                axis.text.x      = element_text(face="bold", color="#000000", size=11),
                axis.text.y      = element_text(face="bold", color="#000000", size=11),
                axis.title.x     = element_text(face="bold", color="#000000", size=11),
                axis.title.y     = element_text(face="bold", color="#000000", size=11))  +
           xlab("Bernoulli Trails") + ylab("Difference between occurance of two outcomes") +
           ggtitle("Empirical Law of Large Numbers: Ratio of Tails/Heads approach to one")

png(file="law_of_large_numbers.png")
p_ratio
dev.off()

#include <Rcpp.h>
#include <stdlib.h>
using namespace Rcpp;

//
// Sum a numeric vector up to an index
//
double sum_nv(NumericVector x, int s) {
  int i=0;
  double ss=0.0;
  for(i=0;i<(s+1);i++) {
   ss=ss+x[i];
  }
  return(ss);
}

//
// Calculate the running difference in a Bernoulli process (binary_vec)
//
// [[Rcpp::export]]
//
NumericVector diff_binary_vec2(NumericVector binary_vec) {
  int ll = binary_vec.size();
  int i  = 0;
  NumericVector diff_vec(ll, 0.0);
  for(i=0;i<ll;i++) {
    diff_vec[i] = std::abs(i+1.0-2.0*sum_nv(binary_vec, i));
  }
  return(diff_vec);
}

//
// Calculate the running ratio in a Bernoulli process (binary_vec)
//
// [[Rcpp::export]]
NumericVector ratio_binary_vec2(NumericVector binary_vec) {
  int ll = binary_vec.size();
  int i  = 0;
  NumericVector ratio_vec(ll, 0.0);
  for(i=0;i<ll;i++) {
    ratio_vec[i] = std::abs(((i+1.0)/sum_nv(binary_vec, i))-1.0);
  }
  return(ratio_vec);
}

Monday, 12 December 2011

Analysis of MPI codes.

Massage passing interface library, since its inception is on the rise constantly in parallel programming community. This recent article discusses the analysis of MPI based codes.

Sunday, 11 December 2011

High Performance Fortran

High performance fortran
Fortran is truly a language of choice in most of the scientific applications and
one of the pioneering in high performance computing. This recent article discusses current status.