Search code examples
node.jsbittorrentdhttorrentwebtorrent

How does DHT torrent indexing sites scrape infoHashes efficiently?


I am interested how DHT torrent indexing site works. I have working scraper of inhoHashes written using nodejs lib. At first time I tried to execute behind NAT, but it was not efficient, then I went to BSD server with public IP and things are really better. In many publications about this topic, I have learnt that best solution is to run several virtual DHT nodes to scrape infoHashes faster. I have code which initiate several DHT nodes instances runned with unique NODEID and on own port.

My nodejs code:

"use strict"

const DHT = require('bittorrent-dht')
const crypto = require('crypto');

let DHTnodeID = []
for(let i = 1; i<=10; i++){
  DHTnodeID.push({[i]:crypto.createHash('sha1').update(`myDHTnodeLocal${i}`).digest('hex')}) //Give each node unique hash ID
}

let dhtOpt =  {
  nodeId: '',      // 160-bit DHT node ID (Buffer or hex string, default: randomly generated)
  //bootstrap: [],   // bootstrap servers (default: router.bittorrent.com:6881, router.utorrent.com:6881, dht.transmissionbt.com:6881)
  host: false,     // host of local peer, if specified then announces get added to local table (String, disabled by default)
  concurrency: 16, // k-rpc option to specify maximum concurrent UDP requests allowed (Number, 16 by default)
  //hash: Function,  // custom hash function to use (Function, SHA1 by default),
  //krpc: krpc(),     // optional k-rpc instance
  //timeBucketOutdated: 900000, // check buckets every 15min
  //maxAge: Infinity  // optional setting for announced peers to time out
}

var dhtNodes = []
for(let i = 1; i<=DHTnodeID.length; i++){
  dhtOpt.nodeId = DHTnodeID[i-1][String(i)]
  dhtNodes.push(new DHT(dhtOpt))
}

let port = 6881 //run 10 DHT nodes
for(let item of dhtNodes){  
  item.listen(port, listenFce)
  item.on('ready', readyFce)
  item.on('announce', announceFce)

  port++
}

Then I found one university research project, where is following statement:

The most obvious approach to increasing throughput is using several DHT nodes instead of one. Using several ports on a single IP address was not considered a viable option due to IP-address based filtering against potential DoS attacks. Instead the indexer is designed to run on several hosts or on a multihomed host. Individual instances synchronize their indexing activity through a shared relational database that stores discovered infohashes and the current processing stage for each .torrent file.

By Aaron Grunthal - University of Applied Sciences Esslingen

If above statement is true does it mean, that my 10 node DHT instances will be considered as DoS attack and can I be penalized somehow? If that is true, how then those websites (DHT torrent indexing site) deal with this problem? Is there any possibility to run efficient infoHash scraper with one public IP on one server? Obviously the more instances I execute the more hashes I get but above statement make me worry. Thank you very much in advance.


Solution

  • If above statement is true does it mean, that my 10 node DHT instances will be considered as DoS attack and can I be penalized somehow?

    That depends on the quality of implementation of the other nodes in the network. Advanced implementations will implement various sanitizing strategies to keep their routing tables free from malicious peers. One of those strategies is to only allow one routing table entry per IP address.

    If that is true, how then those websites (DHT torrent indexing site) deal with this problem?

    They may operate malicious nodes that try to get into more routing tables than a normal node would, but that is countered by above-mentioned sanitizing strategies, thus an unreliable (and harmful to the ecosystem) strategy. They can also operate from multiple IP addresses as mentioned in your quote.

    Is there any possibility to run efficient infoHash scraper with one public IP on one server?

    BEP 51 enables efficient indexing from a single host.