Search code examples
pythonmatrixnumpysparse-matrixpytables

Efficiently store a large sparse matrix (float)


I am looking for a solution to store about 10 million floating point (double precision) numbers of a sparse matrix. The matrix is actually a two-dimensional triangular matrix consisting of 1 million by 1 million elements. The element (i,j) is the actual score measure score(i,j) between the element i and element j. The storage method must allow very fast access to this information maybe by memory mapping the file containing the matrix. I certainly don't want to load all the file in memory.

class Score(IsDescription):
    grid_i = UInt32Col()
    grid_j = UInt32Col()
    score  = FloatCol()

I've tried pytables by using the Score class as exposed, but I cannot access directly to the element i,j without scanning all the rows. Any suggestion?


Solution

  • 10 million double precision floats take up 80 MB of memory. If you store them in a 1 million x 1 million sparse matrix, in CSR or CSC formats, you will need an additional 11 million int32s, for a total of around 125 MB. That's probably less than 7% of the physical memory in your system. And in my experience, on a system with 4GB running a 32-bit version of python, you rarely start having trouble allocating arrays until you try to get a hold of ten times that.

    Run the following code on your computer:

    for j in itertools.count(100) :
        try :
            a = np.empty((j * 10**6,), dtype='uint8`)
            print 'Allocated {0} MB of memory!'.format(j)
            del a
        except MemoryError:
            print 'Failed to allocate {0} MB of memory!'.format(j)
            break
    

    And unless it fails to get you at least 4 times the amount calculated above, don't even hesitate about sticking the whole thing in memory using a scipy.sparse format.

    I have no experience with pytables, nor much with numpy's memmap arrays. But it seems to me that either one of those will involve you coding the logic to handle the sparsity, something I would try to avoid unless impossible to.