Search code examples
algorithmindexingp2pdistributeddht

How p2p search engines could prevent corruption of distributed index by malicious peers?


As a hobby I'm writing simple and primitive distributed web search engine and it occurred to me it currently has no protection against malicious peers trying to skew search results.

Current architecture of the project is storing inverse index and ranking factors in kad dht with peers updating this inverse index as they crawl web.

I've used google scholar in attempt to find some solution but it seems most of the authors of proposed p2p web search ignore above-mentioned problem.

I think I need some kind of reputation system or trust metrics, but my knowledge in this domain is sufficiently lacking and I would very much appreciate a few pointers.


Solution

  • One way you could avoid this is to only use reliable nodes for storing and retrieving values. The reliability of a node will have to be computed by known-good nodes, and it could be something like the similarity of a node's last few computed ranking factors compared to the same ranking factors computed by known-good nodes (i.e. compare the node's scores for google.com to known-good scores for google.com). Using this approach, you'll need to avoid the "rogue reliable node" problem (for example, by using random checks or reducing all reliability scores randomly).

    Another way you could approach this is to duplicate computation of ranking factors across multiple nodes, fetch all of the values at search time, and rank them on the client side (using variance, for example). You could also limit searches to sites that only have >10 duplicate values computed, so that there is some time before new sites are ranked. Additionally, any nodes with values outside of the normal range could be reported by the client in the background, and their reliability scores could be computed this way. This approach is time-consuming for the end user (unless you replicate known-good results to known-good nodes for faster lookups).

    Also, take a look at this paper which describes a sybil-proof weak-trust system (which, as the authors explain, is more robust than the impossible sybil-proof strong-trust system): http://www.eecs.harvard.edu/econcs/pubs/Seuken_aamas14.pdf