With the advent of complex network science and its allied approaches in the last decade or so with the data driven research, using statistical mechanical techniques out side of materials or atomic physics becomes a standard and quite a popular practice. One of the most interesting of this usage was in football (or soccer for Americans). This is probably because of mass football-mania in the UK and rest of the Europe and the new-world of course. However, using statistics is not a new thing at all but finding similarities with the atomic systems. Specially quantitative approach to sports is very well known, a recent film featuring Brad Pitt, Moneyball, shows this. Every major sports club (merchandise in the US) is now running a statistics division, considering the sky high salaries of players. There are interesting works in the goal statistics, being non-Gaussian [link] [link] [link] and passing network for football strategies [link] among other works. This kind of research is classified as econophysics.
Scientific Scratch Pad of Memo:
Physics, Mathematics, Computer Science, Statistics, Chemistry
by Mehmet Süzen
See also: Memo's Island Blog
Thursday, 29 November 2012
Monday, 5 November 2012
Imputation of missing data: Recursive 1D discrete KNN algorithm
Any generated data is often have missing components or values. Probably, the most common occurrence manifest in time series data where there is no value available on the given time point, hence a NaN is placed in general (or NA in R). There is a large literature on how a statistical analysis must be performed in such data sets. For example, a seminal book by Little & Rubin called Statistical Analysis With Missing Data provides very detailed exposure to the field. Probably the simplest of all methods is called imputation. For example there are high quality R packages like imputation or mi that does the job for you. Similarly knnimpute from MATLAB bioinformatics toolbox provides similar solution.
k-nearest neighbour algorithm (KNN) is the most common approach to discover the closest available value in the data vector. However, often implementations of KNN contains a lot of options that are not needed for simple imputations in 1D and Euclidean metric. Here I propose 1D discrete KNN recursive algorithm that scans a given vector and determines the closest available value to given index). The main idea is assuming periodic boundary conditions for the vector index boundaries (Two-way linear search on the ring, see simple sketch). This is an $O(\bf{N})$ algorithm, we scan the vector twice in the worst case scenario. I have implemented this idea in MATLAB, for a given matrix. Files are available on the matworks file exchange [link]. This tool could be useful in imputing data in regression design matrices.
k-nearest neighbour algorithm (KNN) is the most common approach to discover the closest available value in the data vector. However, often implementations of KNN contains a lot of options that are not needed for simple imputations in 1D and Euclidean metric. Here I propose 1D discrete KNN recursive algorithm that scans a given vector and determines the closest available value to given index). The main idea is assuming periodic boundary conditions for the vector index boundaries (Two-way linear search on the ring, see simple sketch). This is an $O(\bf{N})$ algorithm, we scan the vector twice in the worst case scenario. I have implemented this idea in MATLAB, for a given matrix. Files are available on the matworks file exchange [link]. This tool could be useful in imputing data in regression design matrices.
Subscribe to:
Posts (Atom)