Figure 1: Empirical law of large numbers, ratio of occurances approach to a constant. |

**Empirical law of large numbers**

If we repeat an experiment long enough, we would approach to expected outcome. The simplest example is a coin-toss experiment, that an expected

fair coin toss would lead to equal likelihood for head and tail, 1 or 0. This implies that, the ratio of head and tails will approach to one with increasing number of repeats. Let's say, we toss the coin $N$ times. The number of heads and tails would be $n_{1}$ and $n_{0}$. The emprical law of large numbers states

$ \frac{n_{1}}{n_{0}} \to 1.0 $ if $N$ is very large, $N >>1$

But note that, the absolute difference, $|n_{1}-n_{0}|$ does not approach to any constant, on the contrary, it will increase with increasing number of repeats. This is classic example of gambler's fallacy that an outcome would balance out as there are more repeats.

**Fair coin-toss: Evolution of Bernoulli process**

Figure 2: No low of averages. Absolute difference of occurances increases over repeats. |

The Bernoulli process expresses binary outcomes, 1 or 0, i.e., success or failure, true or false. Bernoulli distribution reads,

$Ber(p, k) = p^{k} (1-p)^{1-k}$ for $k \in \{0,1\}$.

$p$ is the probability of success. We draw 50K samples from this distribution to get a Bernoulli process with $p=0.5$ and repeat the experiment 50 times, in order to obtain a "generalised" behaviour with uncertainty. This situation corresponds to a fair coin-toss experiment.

**Results**

Empirical results of ratio of two outcomes and their absolute difference over repeats are reported in Figure 1 and 2 respectively..

**Appendix: Source codes**

R and Rcpp functions are shown in this section to reproduce the plots in this post. Source files are also available on github (here).

#' #' Misconception of the law of averages: Understanding the empirical law of large numbers #' #' (c) 2016 Mehmet Suzen #' rm(list=ls()) set.seed(987231) library(Rcpp) # law level implementations library(LaplacesDemon) # for drawing samples from Bernoulli distribution library(matlab) # for tic/toc # Set working directory to script directory Rcpp:::sourceCpp("empirical_law_of_large_numbers.cpp") # diff_binary_vec2 #' #' Calculate the running difference in a binary vector #' diff_binary_vec <- function(bvec) { ll <- length(bvec) diff_vec <- vector("double", ll) diff_vec <- sapply(1:ll, function(i) { abs(i-2.0*sum(bvec[1:i])) } ) diff_vec } #' #' Calculate the running ratio in a binary vector #' ratio_binary_vec <- function(bvec) { ll <- length(bvec) ratio_vec <- vector("double", ll) ratio_vec <- sapply(1:ll, function(i) { abs(i/sum(bvec[1:i])-1.0) } ) ratio_vec <- sapply(ratio_vec, function(rv) {if(is.infinite(rv)) { rv <- 0; }; rv }) ratio_vec } #' Wall-call timing difference tb <- rbern(20000, 0.5) # a fair-coin tic() t1 <- diff_binary_vec(tb) toc() tic() t2 <- diff_binary_vec2(tb) toc() tic() r1 <- ratio_binary_vec(tb) toc() tic() r2 <- ratio_binary_vec2(tb) toc() #' #' Generate Bernoulli Process #' nr <- 50 # repeats nt <- 20000 # Bernoulli trails tic() bern_df <- data.frame(trail=c(), diff=c(), ratio=c()) for(i in 1:nr) { cat("repeat:",i, "\n") trail <- rbern(nt, 0.5) # a fair-coin diff <- diff_binary_vec2(trail) ratio <- ratio_binary_vec2(trail) bern_df <- rbind(bern_df, cbind(1:nt, diff, ratio)) } #' Now plot ratio and diff evolution with local regression in ggplot2 names(bern_df) <- c("trail", "diff", "ratio") library(ggplot2) p_diff <- ggplot(data=bern_df, aes(x=trail, y=diff)) + geom_smooth(formula="y~x") + theme( panel.background = element_blank(), axis.text.x = element_text(face="bold", color="#000000", size=11), axis.text.y = element_text(face="bold", color="#000000", size=11), axis.title.x = element_text(face="bold", color="#000000", size=11), axis.title.y = element_text(face="bold", color="#000000", size=11)) + xlab("Bernoulli Trails") + ylab("Difference between occurance of two outcomes") + ggtitle("No Law of Averages: Tail/Heads do not balanced out!") png(file="no_law_of_averages.png") p_diff dev.off() p_ratio <- ggplot(data=bern_df, aes(x=trail, y=ratio)) + geom_smooth(formula="y~x") + theme( panel.background = element_blank(), axis.text.x = element_text(face="bold", color="#000000", size=11), axis.text.y = element_text(face="bold", color="#000000", size=11), axis.title.x = element_text(face="bold", color="#000000", size=11), axis.title.y = element_text(face="bold", color="#000000", size=11)) + xlab("Bernoulli Trails") + ylab("Difference between occurance of two outcomes") + ggtitle("Empirical Law of Large Numbers: Ratio of Tails/Heads approach to one") png(file="law_of_large_numbers.png") p_ratio dev.off()

#include <Rcpp.h> #include <stdlib.h> using namespace Rcpp; // // Sum a numeric vector up to an index // double sum_nv(NumericVector x, int s) { int i=0; double ss=0.0; for(i=0;i<(s+1);i++) { ss=ss+x[i]; } return(ss); } // // Calculate the running difference in a Bernoulli process (binary_vec) // // [[Rcpp::export]] // NumericVector diff_binary_vec2(NumericVector binary_vec) { int ll = binary_vec.size(); int i = 0; NumericVector diff_vec(ll, 0.0); for(i=0;i<ll;i++) { diff_vec[i] = std::abs(i+1.0-2.0*sum_nv(binary_vec, i)); } return(diff_vec); } // // Calculate the running ratio in a Bernoulli process (binary_vec) // // [[Rcpp::export]] NumericVector ratio_binary_vec2(NumericVector binary_vec) { int ll = binary_vec.size(); int i = 0; NumericVector ratio_vec(ll, 0.0); for(i=0;i<ll;i++) { ratio_vec[i] = std::abs(((i+1.0)/sum_nv(binary_vec, i))-1.0); } return(ratio_vec); }

Could you please elaborate a bit on why the second thought is wrong? If the number of heads follows a binomial process what is the processes that describes $|n_0-n_1|$?

ReplyDeleteDoes it state that $n_0/n_1$ tend to $1$ but $|n_0-n_1|$ does not tend to zero?

The expectation that d=|$n_{0}-n_{1}|$ would approach to zero is wrong. Why? It is really a good question. I think it is about human cognition. Amos Tversky and Daniel Kahneman call this "cognitive bias" (https://en.wikipedia.org/wiki/Cognitive_bias). Mathematically speaking, ratio $n_{0}/n_{1}$ approaches to one in the limiting $N$, because we set $p=0.5$ in the Bernoulli process. But there is no reason mathematically that $d$ must approach to zero on the limiting $N$. In my interpretation, I think, $d$ mathematically could be considered as sum of all "waiting times", sum of all the number of trails to reach to a success in the Bernoulli process. Not sure if this has a name as a process, maybe "geometric process".

DeleteTheir is an intuitive way of showing how this works via an example. Try a series of fractions such as: 9/10, 90/100, 900/1000, 9000/10000,... You will see that they all represent the rational number 0,9. But consider now the differences between the numerator and denominator of each fraction: 1, 10, 100, 1000, .... In other words the difference keeps on increasing. More generally, if you consider any fraction that tends to 0 in the limit, these will invariably consist of ever larger numbers for both numerator and denominator. For equivalent ratios (same order of magnitude) the differences will invariably increase.

DeleteHTHs

Great post but I feel the code is a bit overkill.

ReplyDeleteShouldn't just

coin <- rbinom(1e4, 1, 0.5)

coin1 <- cumsum(coin) # cumsum of 1

coin0 <- (1:length(coin)) - coin1 # cumsum of 0

plot(coin1/coin0)

plot(abs(coin1-coin0))

do the job?

No extra packages, no C++ code and just 5 lines of code

Thanks for the compact code for reproducing the same thing. I agree that my code is a bit overkill to just to produce the result. However, my purpose was also to show how to accelerate a simulation code using Rcpp. My version runs at least an order of magnitude faster, compare to yours, and of course my plots are more presentable.

DeleteI agree about the graphs. However you are wrong about the timings. Try

Deletetb <- rbern(20000, 0.5) # a fair-coin

tic()

coin1 <- cumsum(tb) # sum of 1

coin0 <- (1:length(tb)) - coin1 # sum of 0

t1 <- abs(coin1-coin0)

toc() # 0.02 sec in my machine

tic()

t2 <- diff_binary_vec2(tb)

toc() # 0.33 sec on my machine

identical(t1, as.integer(t2)) # Rcpp code produces numeric vector

with your code and you will notice my code is at least an order of magnitude faster than Rcpp code. In fact if you replace (1:length(tb)) with seq_along(tb) my code is even faster. In fact your Rcpp code can be further optimised but it cannot be done faster than R code in this case.

Thank you Chris for the optimized R code, well done.

DeleteI only got speed up against my R functions using Rcpp versions, "diff_binary_vec" vs. "diff_binary_vec2" and "ratio_binary_vec" vs. "ratio_binary_vec2".

No worries. cumsum is a primitive function in R and is implemented in C. This is why it is hard to optimise it any further.

Delete