Search code examples
pythonnumpyscipysparse-matrix

Sparse matrixes for matrix-vector product in PageRank computation


I'm implementing some iterative algorithms for calculating the PageRank of a webgraph, and I'm having some trouble finding the best way to store in memory some matrixes.

I have a B n x n matrix, which rappresents the webgraph ( B[i,j]=1/outdegree[j] if there's an arc from j to i, 0 otherwise; outdegree[j] is the number of outgoing arcs from node j) and which I'm storing as a scipy.sparse.dok_matrix since of course it has mostly 0 entries. The problem is I have to compute many matrix x vector products of the type Px, where

P = B + (1/n)*e*d^T

where e is the all ones vector and d is a boolean vector that has 1 in component j if outdegree[j] > 0. Basically e*d^T is sort of a linear algebra "trick" to write a n x n matrix with columns either all made of 1s or 0s, depending if the corresponding entry in d is 1 or 0.

So I'm struggling with two, not totally independent, things:

  1. How do I achieve the same "trick" in numpy, since e*d.T simply computes the scalar product, while I want a matrix. I guess it's some clever use of broadcasting and slicing, but I'm still new with numpy and can't figure it out
  2. If I simply define P as above (supposing I've found a solution to 1.), I loose the memory advantage I gained storing B as a sparse matrix, and suddenly I need to store n^2 floats. And anyways the matrix I'm adding to B is very redundant (there are only two types of columns), so there must be a better way than storing the whole matrix in memory. Any suggestions? Keep in mind that it has to be in a way as to easily allow computation of P.dot(x), for x an arbitrary vector

Solution

  • For simplicity, as expressions with np.dot will be bulky, let denote matrix multiplication, e, d and x be vectors, i.e have shape (n, 1), and in expression with square brackets * is a python's list multiplcation. Then, by associativity

    (e∙d.T)∙x = e∙(d.T∙x) = [[d.T∙x] * n]
    

    where d.T∙x is a scalar, and

    P∙x = B∙x + 1/n * e∙d.T∙x = B∙x + 1/n * [[d.T∙x] * n]
    

    so to be able to make your computations, you can store only vector d. Note that d.T∙x (or equivalently np.dot(d.T, x) if arrays used) is vectors product and is a cheap operation relative to matrix multiplication.