Search code examples
p2pkademlia

What does it mean by Kademlia keys are used to identify nodes as well as data?


Okay, I've been reading articles and the paper about Kademlia recently to implement a simple p2p program that uses kademlia dht algorithm. And those papers are saying, those 160-bit key in a Kademlia Node is used to identify both nodes (Node ID) and the data (which are stored in a form of tuple).

I'm quite confused on that 'both' part.

As far as my understanding goes, each node in a Kademlia binary tree uniquely represents a client(IP, port) who each holds a list of files.

Here is the general flow on my understanding.

  1. Client (.exe) gets booted
  2. Creates a node component
  3. Newly created node joins the network (bootstrapping)
  4. Sends find_node(filehash) to k-closest nodes
    • Let's say hash is generated by hashing file binary named file1.txt
  5. Received nodes each finds the queried filehash in its different hash table
    • Say, a hash map that has a list of files(File Hash, file location)
  6. Step 4,5 repeated until the node is found (meanwhile all associated nodes are updating the buckets)

Does this flow look all right?

Additionally, bootstrapping method of Kademlia too confuses me. When the node gets created (user executes the program), It seems like it uses bootstrapping node to fill up the buckets. But then what's bootstrapping node? Is it another process that's always running? What if the bootstrapping node gets turned off?

Can someone help me better understand the concept?

Thanks for the help in advance.


Solution

  • Does this flow look all right?

    It seems roughly correct, but your wording is not very precise.

    Each node has a routing table by which it organizes the neighbors it knows about and another table in which it organizes the data it is asked to store by others. Nodes have a quasi-random ID that determines their position in the routing keyspace. The hashes of keys for stored data don't precisely match any particular node ID, so the data is stored on the nodes whose ID is closest to the hash, as determined by the distance metric. That's how node IDs and key hashes are used for both.

    When you perform a lookup for data (i.e. find_value) you ask the remote nodes for the k-closest neighbor set they have in their routing table, which will allow you to home in on the k-closest set for a particular target key. The same query also asks the remote node to return any data they have matching that target ID.

    When you perform a find_node on the other hand you're only asking them for the closest neighbors but not for data. This is primarily used for routing table maintenance where you're not looking for any data.

    Those are the abstract operations, if needed an actual implementation could separate the lookup from the data retrieval, i.e. first perform a find_node and then use the result set to perform one or more separate get operations that don't involve additional neighbor lookups (similar to the store operation).

    Since kademlia is UDP-based you can't really serve arbitrary files because those could easily exceed reasonable UDP packet sizes. So in practice kademlia usually just serves as a hash table for small binary values (e.g. contact information, public keys and such). Bulk operations are either performed by other protocols bootstrapped off those values or by additional operations beyond those mentioned in the kademlia paper.

    What the paper describes is only the basic functionality for a routing algorithm and most basic key value storage. It is a spherical cow in a vacuum. Actual implementations usually need additional features or work around security and reliability problems faced on the public internet.

    But then what's bootstrapping node? Is it another process that's always running? What if the bootstrapping node gets turned off?

    That's covered in this question (by example of the bittorrent DHT)