Requirements
functional
- topK(k, startTime, endTime)
- scalable
- available
- highly performant: few ten milliseconds to return top 100 list
- accurate
single host approach
- build hashTable<Key, Count>
- sort the whole list: O(nLog(n))
- use heap: O(nLog(k))
but it’s not scalable.
Hash table, multiple hosts
you can scale previous approach by using data paritioner. scalability and througput has been addressed.
But streaming data is not bounded. It has infinite data. what if we need to calculate top K for a day or a week?
count-min sketch, multiple hosts
There is well-known data structure called count-min sketch, kind of approximation algorithm, which guarantees fixed memory usage. basically it tradedoff between accuracy and memory.
count-min sketch uses 2 dimensional array, each row is different hash function, column is count. whenever new event comes in, calculate hash value for each row, and increment the value of the cell by 1. This means that it could have collision, that’s why we take the smallest value as a result.
But this doesn’t give us accurate top K lists. if constraints don’t allow inaccuracy, then we need a different approach.