Distributed hash table
A distributed hash table (DHT) is a class of a decentralized distributed system that provides a lookup service similar to a hash table. (key,value) pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key.
DHTs form an infrastructure that can be used to build more complex services, such as anycast, cooperative Web caching, distributed file systems, domain name services, instant messaging, multicast, and also peer-to-peer file sharing and content distribution systems.
Unlike unstructured P2P, DHT is tightly coupled between nodes and file locations. (when request a content, directly go to the content instead of searching by flooding)
DHT has the following properties:
Autonomy and Decentralization: the nodes collectively form the system without any central coordination.
Fault tolerance: the system should be reliable (in some sense) even with nodes continuously joining, leaving, and failing.
Scalability: the system should function efficiently even with thousands or millions of nodes.
Building a DHT
- Hash function that maps a file to a unique ID. Eg. hash(“Harry Potter”) -> 3912.
- Distribute range space for all nodes in the network.
- The desinated node stores the location of the file. (this is indirect approach)
Search in DHT
- Search query routed to the node whose range covers the file.
- Each node would retains a routing information that is implemented in a fully distributed manner (i.e. no central point, no single point of failure).
There is different hashing and routing techniques associated with DHT. The most important is Consistent Hashing and Chord Routing.
Consistent hashing is a special kind of hashing such that when a hash table is resized and consistent hashing is used, only K/n keys need to be remapped on average, where K is the number of keys, and n is the number of slots.
In most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped. Specifically, the 3 cases below can end up in a technology crisis:
leaves/failures - 一个 cache 服务器 m down 掉了（在实际应用中必须要考虑这种情况），这样所有映射到 cache m 的对象都会失效，怎么办，需要把 cache m 从 cache 中移除，这时候 cache 是 N-1 台，映射公式变成了 hash(object)%(N-1)；
join - 由于访问加重，需要添加 cache ，这时候 cache 是 N+1 台，映射公式变成了 hash(object)%(N+1)
scalability - 由于硬件能力越来越强，你可能想让后面添加的节点多做点活，显然上面的 hash 算法也做不到。
Consistent hashing is based on mapping each object to a point on the edge of a circle. The system maps each available machine to pseudo-randomly distributed points on the edge of the same circle.
For a very detailed steps of consistent hashing, read this Chinese blog.
In this way, 一致性Hash在node加入/离开时，不会导致映射关系的重大变化。
Simple Routing would search successor node, and runtime is linear. These node would keep O(1) routing information, and spend O(n) time in query routing.
Otherwise, we make every node store ID and IP of all nodes, thus query routing takes O(1) but routing information is O(n).
We’ll now discuss Chord Routing.
Each node stores more info closely following it on the identifier circle than nodes further away. That is, the subsequent nodes at position 1, 2, 4, 8, 16, 32… (each entry is called a finger)
Routing information maintained at each node: O(logN).
Query routing take O(logN) time.
Join and leave in Chord
It’s very much like insertion and removal in Doubly Linked List. Read it yourself.
Special thanks to the online resources written by some CSDN bloggers.