python performance mongodb scipy sparse-matrix

mongodb to python sparse matrix, how to make it faster?

I have n documents in MongoDB containing a scipy sparse vector, stored as a pickle object and initially created with scipy.sparse.lil. The vectors are all of the same size, say p x 1.

What I need to do is to put all these vectors into a sparse n x p matrix back in python. I am using mongoengine and thus defined a property to load each pickle vector:

class MyClass(Document):

    vector_text = StringField()

    @property
    def vector(self):
        return cPickle.loads(self.vector_text)

Here's what I'm doing now, with n = 4700 and p = 67:

items = MyClass.objects()
M = items[0].vector
for item in items[1:]:
    to_add = item.vector
    M = scipy.sparse.hstack((M, to_add))

The loading part (i.e. calling n times the property) takes about 1.3s. The stacking part about 2.7s. Since in the future n is going to seriously increase (possibly more than a few hundred thousands), I sense that this is not optimal :) Any idea to speed the whole thing up? If you know how to fasten the "loading" or the "stacking" only, I'm happy to hear it. For instance maybe the solution is to store the entire matrix in mongoDB? Thanks !

Solution

First, what you describe you want to do would require you using vstack, not hstack. In any case, your choice of sparse format is part of your performance problem. Try the following:

n, p = 4700, 67
csr_vecs = [sps.rand(1, p, density=0.5, format='csr') for j in xrange(n)]
lil_vecs = [vec.tolil() for vec in csr_vecs]

%timeit sps.vstack(csr_vecs, format='csr')
1 loops, best of 3: 722 ms per loop

%timeit sps.vstack(lil_vecs, format='lil')
1 loops, best of 3: 1.34 s per loop

So there's already a 2x improvement simply from swithcing to CSR. Furthermore, the stacking functions of scipy.sparse do not seem to be very optimized, definitely not for sparse vectors. The following two functions stack a list of CSR or LIL vectors, returning a CSR sparse matrix:

def csr_stack(vectors):
    data = np.concatenate([vec.data for vec in vectors])
    indices = np.concatenate([vec.indices for vec in vectors])
    indptr = np.cumsum([0] + [vec.nnz for vec in vectors])
    return sps.csr_matrix((data, indices, indptr), shape=(len(vectors),
                                                          vectors[0].shape[1]))
import itertools as it
def lil_stack(vectors):
    indptr = np.cumsum([0] + [vec.nnz for vec in vectors])
    data = np.fromiter(it.chain(*(vec.data[0] for vec in vectors)),
                       dtype=vectors[0].dtype, count=indptr[-1])
    indices = np.fromiter(it.chain(*(vec.rows[0] for vec in vectors)),
                          dtype=np.intp, count=indptr[-1])
    return sps.csr_matrix((data, indices, indptr), shape=(len(vectors),
                                                          vectors[0].shape[1]))

It works:

>>> np.allclose(sps.vstack(csr_vecs).A, csr_stack(csr_vecs).A)
True
>>> np.allclose(csr_stack(csr_vecs).A, lil_stack(lil_vecs).A)
True

And is substantially faster:

%timeit csr_stack(csr_vecs)
100 loops, best of 3: 11.7 ms per loop

%timeit lil_stack(lil_vecs)
10 loops, best of 3: 37.6 ms per loop

%timeit lil_stack(lil_vecs).tolil()
10 loops, best of 3: 53.6 ms per loop

So, by switching to CSR, you can improve performance by over 100x. If you stick with LIL, your performance improvement will be only around 30x, more if you can live with CSR in the combined matrix, less if you insist on LIL.