Top K problem

2021-06-14

Requirements

functional

topK(k, startTime, endTime)
scalable
available
highly performant: few ten milliseconds to return top 100 list
accurate

single host approach

build hashTable<Key, Count>
- sort the whole list: O(nLog(n))
- use heap: O(nLog(k))

but it’s not scalable.

Hash table, multiple hosts

you can scale previous approach by using data paritioner. scalability and througput has been addressed.

But streaming data is not bounded. It has infinite data. what if we need to calculate top K for a day or a week?

count-min sketch, multiple hosts

There is well-known data structure called count-min sketch, kind of approximation algorithm, which guarantees fixed memory usage. basically it tradedoff between accuracy and memory.

count-min sketch uses 2 dimensional array, each row is different hash function, column is count. whenever new event comes in, calculate hash value for each row, and increment the value of the cell by 1. This means that it could have collision, that’s why we take the smallest value as a result.

But this doesn’t give us accurate top K lists. if constraints don’t allow inaccuracy, then we need a different approach.

Inertia

Top K problem

Requirements

single host approach

Hash table, multiple hosts

count-min sketch, multiple hosts

high level design

Reference

About

Categories

Tags

Tag Cloud

Archives

Recents