- Do not write anything unrelated to CS.
- Do not write too long - 1 or 2 pages are fine. Senior engineer 3 pages.
- Do not write low GPA
- Never ever write “proficient in anything”
Most classic question is “Frequent items” (refer to July’s blog).
Find top k hot queries in a daily access log of Google.
- k = 1 vs k = 100000 - majority numbers
- low RAM vs sufficient RAM
- single machine vs multiple machines
- accurate vs inaccurate
- HashTable + Heap (min-heap)
- Time O(n * logk), Space O(n)
- Split into 1000 (i.e. LOG/M) files by hash(query) % 1000
- Using HashTable + Heap to get top k for each files
- Collect 1000 top k queries and get global top k
- This method requires a lot of disk access and r/w, still slow.
Inaccurate (reduce memory from O(n) to O(k))
- Hash Count (only need to know this one) Limit the size of HashMap. The bigger the RAM, the more accurate is the result.
- Space Saving
- Lossy Counting
- Sticky Sampling
- Count Sketch
- Regular bloom filter - use 4 线性无关 formula
- Counting bloom filter - support delete
- Better DS than HashMap, but can loose some accuracy
Find all unique queries - use bigmap to store 3 types of states
Design a short url system
to store hot urls
- Load Balance
Too many click in short time
- Storage balance
Hash value of an url and then store in individual machine
- Consistent Hash
Node, can increase # of machines to store information
check which machine response my query
what is router is down?
url frequently access by China, then put the url storage in Beijing
Need-to-know Design patterns
- Master-slave (esp. for relational DB)