Search code examples
pythonmatrixsparse-matrix

How are sparse matrices affecting memory usage?


In the example below, I am creating a big numpy object with zeros, putting a random number on the diagonal and then converting to a scipy sparse matrix. My reporting of memory usage comes from the task manager.

>>> import sys, random
>>> import numpy as np
>>> from scipy import sparse
## Memory in use at this point: 3.1 Gb
>>> m = np.zeros(shape = (40000, 40000), dtype = float)
>>> sys.getsizeof(m)
12800000112
## Memory in use at this point: 3.3 Gb
>>> for i in range(40000):
        m[i][i] = round(random.random(),3)        
>>> sys.getsizeof(m)
12800000112
## Memory in use at this point: 3.3 Gb
>>> mSp = sparse.csr_matrix(m)
>>> sys.getsizeof(mSp)
56
## Memory in use at this point: 14.9 Gb
>>> del m
## Memory in use at this point: 3.1 Gb

My question is, why during the creation of the sparse matrix the memory usage jumps to 15 gb and only falls down to 3.1 Gb when I am deleting the original numpy object which initially took up only 200 Mb of memory?

I am suspecting this has something to do with the memory in use vs memory committed but am straggling to understand the mechanism.

EDIT: I am running this on Windows 10


Solution

  • This has nothing to do with the sparse matrix but with the way modern operating systems allocates memory: When requesting memory, your OS will immediately return with an address, but will not actually allocate pages in physical memory. Only upon first touching the data in a page (reading or writing), each individual page is allocated. Since you are only setting a few values, only a few pages will be allocated and all pages that weren't touched are actually not in memory.

    This is typically shown as virtual (VIRT) memory and physical (PHYS) memory. Everything that is accounted for in VIRT but isn't there in PHYS is memory that has been allocated, but hasn't been touched yet.

    You are seeing an increase in memory consumption because transforming a matrix to a sparse.csr_matrix requires SciPy to read the entire array. Which in turn makes your operating system allocate all pages and fill them with zeros.

    To understand this we can use the following examples: Before importing anything, my ipython kernel is at

    # 2GB VIRT, 44MB PHYS
    

    We allocate memory but fill it with zeros, so we haven't touched anything. We use a lot of VIRT but almost no PHYS RAM.

    import numpy
    array = numpy.zeros((10000, 50000))
    # 6GB VIRT, 50MB PHYS
    

    After setting our diagonal with random values, we see only a slight increase in PHYS, because most of our pages are not yet actually physically there in RAM.

    # this is a more efficient way of setting the main diagonal of your array, by the way
    array[numpy.arange(10000), numpy.arange(10000)] = numpy.random.rand(10000)
    # 6GB VIRT, 90MB PHYS
    

    If we now calculate the sum of our array suddenly the usage increases, because reading the memory triggers a phyiscal allocation:

    numpy.sum(array)
    # 6GB VIRT, 4GB PHYS
    

    And the same when creating an array filled with random values: all of it is immediately physically allocated.

    array = numpy.random.rand(10000, 50000)
    # 6GB VIRT, 4GB PHYS
    

    This is the reason why creating sparse arrays directly in a sparse format is recommended:

    import scipy.sparse
    sparse_array = scipy.sparse.dok_matrix((10000, 50000))
    # 2GB VIRT, 50MB PHYS
    

    DOK allows indexing, so we can efficiently do

    sparse_array[numpy.arange(10000), numpy.arange(10000)] = numpy.random.rand(10000)
    # 2GB VIRT, 54MB PHYS
    

    and allows efficient transform to CSR:

    csr_sparse_array = scipy.sparse.csr_matrix(sparse_array)
    # 2GB VIRT, 54MB PHYS
    

    These values were calculated on OSX, but the general principle holds for Linux, OSX and Windows.