Scientific Memo

Platonic dices (Wikipedia).

One of the misconceptions in our understanding of

statistics, or a counter-intuitive guess, fallacy, appears in the assumption of the existence of the law of averages. Imagine we toss a fair coin many times, most people would think that the number of heads and tails would be balanced over the increasing number of trails, which is wrong. If you don't, then you might have a very good statistical intuition. Briefly, we will illustrate this, a kind of gambler's fallacy with a simple simulation approach and discuss the empirical law of large numbers.

Empirical law of large numbers

If we repeat an experiment long enough, we would approach to expected outcome. The simplest example is a coin-toss experiment, that an expected
fair coin toss would lead to equal likelihood for head and tail, 1 or 0. This implies that, the ratio of head and tails will approach to one with increasing number of repeats. Let's say, we toss the coin $N$ times. The number of heads and tails would be $n_{1}$ and $n_{0}$. The emprical law of large numbers states

$ \frac{n_{1}}{n_{0}} \to 1.0 $ if $N$ is very large, $N >>1$

But note that, the absolute difference, $|n_{1}-n_{0}|$ does not approach to any constant, on the contrary, it will increase with increasing number of repeats. This is classic example of gambler's fallacy that an outcome would balance out as there are more repeats.

Fair coin-toss: Evolution of Bernoulli process

The Bernoulli process expresses binary outcomes, 1 or 0, i.e., success or failure, true or false. Bernoulli distribution reads,

$Ber(p, k) = p^{k} (1-p)^{1-k}$ for $k \in \{0,1\}$.

Figure 2: No law of averages. Absolute difference
of occurances
increases over repeats.

Figure 1: Empirical law of large numbers, ratio of
occurances approach to a constant.

$p$ is the probability of success. We draw 50K samples from this distribution to get a Bernoulli process with $p=0.5$ and repeat the experiment 50 times, in order to obtain a "generalised" behaviour with uncertainty. This situation corresponds to a fair coin-toss experiment.

Results

Empirical results of ratio of two outcomes and their absolute difference over repeats are reported in Figure 1 and 2 respectively..

Appendix: Source codes

R and Rcpp functions are shown in this section to reproduce the plots in this post. Source files are also available on github (here).

#'
#'  Misconception of the law of averages: Understanding the empirical law of large numbers
#' 
#'  (c) 2016 Mehmet Suzen 
#'  

rm(list=ls())
set.seed(987231)
library(Rcpp)          # law level implementations
library(LaplacesDemon) # for drawing samples from Bernoulli distribution
library(matlab)        # for tic/toc
# Set working directory to script directory 
Rcpp:::sourceCpp("empirical_law_of_large_numbers.cpp") # diff_binary_vec2
#'
#' Calculate the running difference in a binary vector 
#' 
diff_binary_vec  <- function(bvec) {
 ll <- length(bvec) 
 diff_vec  <- vector("double", ll)
 diff_vec  <- sapply(1:ll, function(i) { abs(i-2.0*sum(bvec[1:i])) } )
 diff_vec
}
#'
#' Calculate the running ratio in a binary vector 
#'
ratio_binary_vec  <- function(bvec) {
 ll         <- length(bvec) 
 ratio_vec  <- vector("double", ll)
 ratio_vec  <- sapply(1:ll, function(i) { abs(i/sum(bvec[1:i])-1.0) } )
 ratio_vec  <- sapply(ratio_vec, function(rv) {if(is.infinite(rv)) { rv <- 0; }; rv })
 ratio_vec
}

#' Wall-call timing difference 
tb <- rbern(20000, 0.5) # a fair-coin
tic()
t1 <- diff_binary_vec(tb)
toc()
tic()
t2 <- diff_binary_vec2(tb)
toc()
tic()
r1 <- ratio_binary_vec(tb)
toc()
tic()
r2 <- ratio_binary_vec2(tb)
toc()
#' 
#' Generate Bernoulli Process
#' 
nr        <- 50    # repeats
nt        <- 20000 # Bernoulli trails
tic()
bern_df <- data.frame(trail=c(), diff=c(), ratio=c())
for(i in 1:nr) {
  cat("repeat:",i, "\n")
  trail   <- rbern(nt, 0.5) # a fair-coin
  diff    <- diff_binary_vec2(trail)
  ratio   <- ratio_binary_vec2(trail)
  bern_df <- rbind(bern_df, cbind(1:nt, diff, ratio))
}

#' Now plot ratio and diff evolution with local regression in ggplot2
names(bern_df) <- c("trail", "diff", "ratio")
library(ggplot2)
p_diff <- ggplot(data=bern_df, aes(x=trail, y=diff)) + geom_smooth(formula="y~x") + 
          theme(
                panel.background = element_blank(), 
                axis.text.x      = element_text(face="bold", color="#000000", size=11),
                axis.text.y      = element_text(face="bold", color="#000000", size=11),
                axis.title.x     = element_text(face="bold", color="#000000", size=11),
                axis.title.y     = element_text(face="bold", color="#000000", size=11))  +
           xlab("Bernoulli Trails") + ylab("Difference between occurance of two outcomes") +
           ggtitle("No Law of Averages: Tail/Heads do not balanced out!")

png(file="no_law_of_averages.png")
p_diff
dev.off()

p_ratio <- ggplot(data=bern_df, aes(x=trail, y=ratio)) + geom_smooth(formula="y~x") + 
          theme(
                panel.background = element_blank(), 
                axis.text.x      = element_text(face="bold", color="#000000", size=11),
                axis.text.y      = element_text(face="bold", color="#000000", size=11),
                axis.title.x     = element_text(face="bold", color="#000000", size=11),
                axis.title.y     = element_text(face="bold", color="#000000", size=11))  +
           xlab("Bernoulli Trails") + ylab("Difference between occurance of two outcomes") +
           ggtitle("Empirical Law of Large Numbers: Ratio of Tails/Heads approach to one")

png(file="law_of_large_numbers.png")
p_ratio
dev.off()

#include <Rcpp.h>
#include <stdlib.h>
using namespace Rcpp;

//
// Sum a numeric vector up to an index
//
double sum_nv(NumericVector x, int s) {
  int i=0;
  double ss=0.0;
  for(i=0;i<(s+1);i++) {
   ss=ss+x[i];
  }
  return(ss);
}

//
// Calculate the running difference in a Bernoulli process (binary_vec)
//
// [[Rcpp::export]]
//
NumericVector diff_binary_vec2(NumericVector binary_vec) {
  int ll = binary_vec.size();
  int i  = 0;
  NumericVector diff_vec(ll, 0.0);
  for(i=0;i<ll;i++) {
    diff_vec[i] = std::abs(i+1.0-2.0*sum_nv(binary_vec, i));
  }
  return(diff_vec);
}

//
// Calculate the running ratio in a Bernoulli process (binary_vec)
//
// [[Rcpp::export]]
NumericVector ratio_binary_vec2(NumericVector binary_vec) {
  int ll = binary_vec.size();
  int i  = 0;
  NumericVector ratio_vec(ll, 0.0);
  for(i=0;i<ll;i++) {
    ratio_vec[i] = std::abs(((i+1.0)/sum_nv(binary_vec, i))-1.0);
  }
  return(ratio_vec);
}

9 comments:

Richard W3 August 2016 at 00:35
Could you please elaborate a bit on why the second thought is wrong? If the number of heads follows a binomial process what is the processes that describes $|n_0-n_1|$?
Does it state that $n_0/n_1$ tend to $1$ but $|n_0-n_1|$ does not tend to zero?
Unknown3 August 2016 at 02:17
Great post but I feel the code is a bit overkill.
Shouldn't just

coin <- rbinom(1e4, 1, 0.5)
coin1 <- cumsum(coin) # cumsum of 1
coin0 <- (1:length(coin)) - coin1 # cumsum of 0
plot(coin1/coin0)
plot(abs(coin1-coin0))

do the job?
No extra packages, no C++ code and just 5 lines of code
Riza Berkan - Arts and Sciences26 May 2020 at 20:55
Interesting article. Good point.

Scientific Memo

Monday, 1 August 2016

Understanding the empirical law of large numbers and the gambler's fallacy

9 comments:

Mehmet Suzen

Related

(c) Copyright 2008-2024 Mehmet Suzen (suzen at acm dot org)