network-programming p2p bittorrent dht kademlia

How does Kademlia protocol guarantee peers forming a connected graph?

Nodes: Clients on DHT-network.
Peers: Clients trying to download a specific resource.

Suppose that the DHT-network is a connected graph, but NO nodes can access ALL other nodes (a consumption contrary to the common belief that the Internet, which DHT-network overlays on, is fully connected).

Is the Peer-network, which overlays on DHT-network, is still a connected graph? Why?

Solution

Kademlia is an abstract algorithm that assumes spherical cows in a vacuum. The only failure modes the paper discusses are churn and temporary graph partitions. Asymmetric reachability is not considered.

Kademlia as implemented in the real world makes no guarantees. Everything is done on a best-effort probabilities-are-good-enough basis.

The main concern in the real world are not nodes where interconnected cluster A cannot talk to a interconnected cluster B. NATs and firewalls do not introduce such clusters on a considerable scale. They create a set of second-class citizens which are not consistently reachable by anyone - absent NAT traversal measures - and thus can only connect to the first-class citizens which are the nodes where anyone can talk to anyone else. Of course a few edge cases exist, but they're largely irrelevant.

Anyway, since you're not even asking about kademlia but about bittorrent, which is not really an overlay over kademlia but a separate network which simply bootstraps its contact information from kademlia things get even more complicated. Bittorrent can be implemented over two different transport mechanisms, TCP and µTP, and clients may support different levels of nat traversal capabilities for TCP, µTP and Kademlia-via-UDP.

Kademlia nodes generally store contact information for bittorrent on several reachable nodes, since they - quite obviously - cannot reach unreachable nodes for the purpose of storage. They also do so with redundancy, which ensures a high likelihood that the stored contact information can be observed by anyone else.

Based on that contact information bittorrent clients can then attempt to connect to each other. As long as there are some reachable bittorrent clients they will be able to establish a direct connection and then may additionally be able to attempt some nat traversal measures between non-reachable nodes. Again, there are no guarantees, so small swarms may fail under some circumstances, but once a swarm becomes large enough the probabilities tip overwhelmingly in the favor of the graph becoming connected.

A small additional concern is IPv4 vs. IPv6. Generally IPv6 provides better connectivity (if firewalls don't get in the way) but not all clients implement the ipv6 extensions equally well, thus possibly preventing a few v6-edges from forming when they would in principle provide superior connectivity between the same nodes.

Note that ipv4 and ipv6 DHTs are in theory independent DHT networks, they just happen to have some significant overlap. It's basically outside the scope of kademlia how to coordinate multiple independent networks.