One link per day for the next 31

N choose K: The best algorithm no one knows about


15/01/2018     permalink     linked site


How to pick randomly N items into K, with uniform distribution ?

The straightforward way is to mix the K items randomly, then take the N firsts. But that's long, awful, and infeasible on big datasets (because, you know, you need to get the full dataset in memory).

This article explains and show an implementation of the Vitter algorithm, published in 1987, achieving a linear time complexity and quasi-constant memory complexity. Yes, you just read that.

I have to say: i had the intuition of that principle since a while, but it is by reading this article i really gain confidence into the intuition. It's intuitive that it works, but to me the proof is hard to achieve.

I implemented a Python version, and benchmarked it against other technics, founding that my implementation is quicker and more flexible than the one in stdlib as soon as the number of elements is dropping.