Article

The Hashing Trick

If you are doing any sort of classification with thousands of features (as is common in text classification) and need to update your Bag of Words models online, then you’ll find this rather simple technique (…) Read more

  • Comments Off
  • Filed under: code
Article

Q-Digest

Continuing on the same note as the Greenwald-Khanna post, we discuss another novel data structure for order statistics called Q-Digest. Compared to GK, Q-Digest is much simpler, easier to implement and extends to a distributed (…) Read more

Article

Stream Algorithms: Order Statistics

Algorithms on streams have been all the rage in the early 2000’s. Right now I’m interested in obtaining order statistics, such as quantile information, in the case where there is just too much data to fit in memory. These algorithms have evolved a lot from the database world, where query optimizers have been maintaining histograms and other statistics for years.

In this particular blog post, we will examine the case where we have millions of data and we want to know things like the median, the 90th percentile, the rank of a specific element, etc. These are all really simple to calculate when all the data is in memory. In our case though we can’t do that, but we are willing to sacrifice some accuracy. Read more

  • Comments Off
  • Filed under: code