I'm wondering if the relatively slow performance I'm experiencing with JUNG2 and PageRank is reasonable, considering the data I'm using.
Given a graph with ~200 vertices and ~40000 edges, it takes about 15 seconds to compute the PageRank of the graph. Given a graph with ~900 vertices and ~800000 edges, it takes nearly 5 minutes to compute the PageRank of the graph.
I realize these graphs are very dense and maybe that is naturally expected to take a while to calculate. However, these graphs are not particularly big, and at this rate it would be impossible to calculate anything but a small subset of a graph at a time. (approx 8 hrs for 10,000 vertices and 100,000,000 edges).
I'm using a DirectedSparseGraph with primitive ints as the vertices and edges, and no edge weights. These are obviously NOT sparse graphs, so perhaps there is a more appropriate graph structure to be using that would be more optimal for dense graphs?
I'm running the jvm in server mode with tons of heap space, and have confirmed that the entire graph fits into memory and no swap is used.
The results of the ranking are correct, sum to 1, and match the test examples inside Jung source and elsewhere.
Maybe there is a much more high performance approach and or library for this? But, even if it was 10 times faster, doesn't the fact that it appears to have a time complexity proportional to the number of edges mean that I'm quickly going to exceed the limits of whatever approach I'm using?
For example, I've considered just skipping the graph approach altogether and using a matrix to represent transition probabilities, which would probably provide a significant boost. But - PageRank isn't the only algorithm I'm interested in. If I start rolling my own approaches that could solve one problem and introduce several others.
Also, I'm on a slow box, Dual Core 1Ghz, so, your numbers would probably be a lot better (like say 4-5 secs for my first example) but the point holds. My main concern is if the code is expected to run order(s) of magnitude faster, or expected to scale logarithmically.
Anyway, thanks for your insights.
tl;dr: this is an inherent limitation of the PageRank algorithm, asymptotically speaking.
Any algorithm that wants to take into account all the edges is going to have time complexity that's Omega(|E|). And any algorithm that is trying to measure global topological properties (e.g. eigenvector measures such as PageRank) is going to have to look at each edge at least once.
You may be able to improve the performance by some constant by changing the Graph implementation, but ultimately doing stuff on dense graphs tends to be expensive.
Framing question: what are the semantics of your graph and what are you hoping to learn by calculating PageRank? What problem are you trying to solve?