Search code examples
algorithmgraphgraph-theorypagerank

How is pagerank calculated in a distributed way?


I understand the the idea behind pagerank and have implemented it(when reading the book "programming collective intelligence").

But I read it could be distributed across several servers(as I guess google is doing). I'm a bit confused because according to my understanding you needed the entire graph in order to do page rank on it since each ranking was relative to others ranking.

I found the wiki article but it didn't explain much.

Any suggestions of how this is possible? Also, bonus question: is the technique to do distributed pagerank exclusive to pagerank or can the method used be applied to other machine learning algorithms applied to graphs?


Solution

  • The state of the art way of calculating PageRank is with the Google Pregel framework. I'm pretty sure that they have something more sophisticated right now, but that is the latest published effort.

    You can read more details about it in the research blog. Or read the published paper here.

    I'm working on an open source version of the Bulk Synchronous Parallel paradigm called Apache Hama. There is also Apache Giraph which solely focusses on the graph usecases and lots of others.

    Like mfrankli mentioned, there is also the MapReduce framework (Apache Hadoop for example) that can be used to calculate PageRank, but it is not efficient for iterative algorithms.

    The noteworthy thing to add is that both solutions (MapReduce and BSP) are batch solutions, so they may be used to recalculate the PageRank for the complete webgraph. Since Google updates are much faster than batch-algorithms, you can expect that they frequently recalculate PageRank on subgraphs.