Inertia

Top K problem

Requirements

functional

  • topK(k, startTime, endTime)
  • scalable
  • available
  • highly performant: few ten milliseconds to return top 100 list
  • accurate

single host approach

image

  • build hashTable<Key, Count>
    • sort the whole list: O(nLog(n))
    • use heap: O(nLog(k))

but it’s not scalable.

Hash table, multiple hosts

image

you can scale previous approach by using data paritioner. scalability and througput has been addressed.

But streaming data is not bounded. It has infinite data. what if we need to calculate top K for a day or a week?

count-min sketch, multiple hosts

There is well-known data structure called count-min sketch, kind of approximation algorithm, which guarantees fixed memory usage. basically it tradedoff between accuracy and memory.

image

count-min sketch uses 2 dimensional array, each row is different hash function, column is count. whenever new event comes in, calculate hash value for each row, and increment the value of the cell by 1. This means that it could have collision, that’s why we take the smallest value as a result.

But this doesn’t give us accurate top K lists. if constraints don’t allow inaccuracy, then we need a different approach.

high level design

image

Reference