Search code examples
javaalgorithmgraph-algorithmpagerank

PageRank implementation for Research


After reading theory of PageRank algorithm from this site I would like to play with it. I am trying to implement this in Java. I mean I would like to play with PageRank in detail (like giving different weights and so on). For this I need to build hyperlink matrix. If I have 1 million nodes then my hyperlink matrix will be 1 million x 1 million size, which causes this exception:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at WebGraph.main(WebGraph.java:6)

How can I implement PageRank in Java, is there any way of storing hyperlink matrix?


Solution

  • That is a great article to learn about pagerank. I implemented a Perl version from it here to use with Textrank. However, if you want to just learn about pagerank and how the various aspects discussed in the article affect the results (dampening factor, direct or undirected graph, etc.), I would recommend running experiments in R or Octave. If you want to learn how to implement it efficiently, then programming it up from scratch, as you are doing, is best.

    Most web graphs (or networks) are very sparse, which means most of the entries in the matrix representation of the graph are zero. A common data structure used to represent a sparse matrix is a hash-map, where the zero values are not stored. For example, if the matrix was

    1, 0, 0
    0, 0, 2,
    0, 3, 0
    

    a two dimension hash-map would store only the values for hm(0,0)=1, hm(1,2)=2, and hm(2,1)=3. So in a 1,000,000 by 1,000,000 matrix of a web graph, I would expect only a few million values to be non-zero. If each row averages only 5 non-zero values, a hash-map will use about 5*(8+8+8)10^6 bytes ~ 115mb to store it (8 for the left int index, 8 for the right int index, and 8 for the double value). The square matrix will use 810^6*10^6 ~ 7 terabytes.

    Implementing an efficient sparse matrix-vector multiply in Java is not trivial, and there are some already implemented if you don't want to devote time to that aspect of the algorithm. The sparse-matrix multiply is the most difficult aspect to implement of the pagerank algorithm, so after that it gets easier (and more interesting).