How are sparse matrices affecting memory usage?

In the example below, I am creating a big numpy object with zeros, putting a random number on the diagonal and then converting to a scipy sparse matrix. My reporting of memory usage comes from the task manager.

>>> import sys, random
>>> import numpy as np
>>> from scipy import sparse
## Memory in use at this point: 3.1 Gb
>>> m = np.zeros(shape = (40000, 40000), dtype = float)
>>> sys.getsizeof(m)
12800000112
## Memory in use at this point: 3.3 Gb
>>> for i in range(40000):
        m[i][i] = round(random.random(),3)        
>>> sys.getsizeof(m)
12800000112
## Memory in use at this point: 3.3 Gb
>>> mSp = sparse.csr_matrix(m)
>>> sys.getsizeof(mSp)
56
## Memory in use at this point: 14.9 Gb
>>> del m
## Memory in use at this point: 3.1 Gb

My question is, why during the creation of the sparse matrix the memory usage jumps to 15 gb and only falls down to 3.1 Gb when I am deleting the original numpy object which initially took up only 200 Mb of memory?

I am suspecting this has something to do with the memory in use vs memory committed but am straggling to understand the mechanism.

EDIT: I am running this on Windows 10

Solution

This has nothing to do with the sparse matrix but with the way modern operating systems allocates memory: When requesting memory, your OS will immediately return with an address, but will not actually allocate pages in physical memory. Only upon first touching the data in a page (reading or writing), each individual page is allocated. Since you are only setting a few values, only a few pages will be allocated and all pages that weren't touched are actually not in memory.

This is typically shown as virtual (VIRT) memory and physical (PHYS) memory. Everything that is accounted for in VIRT but isn't there in PHYS is memory that has been allocated, but hasn't been touched yet.

You are seeing an increase in memory consumption because transforming a matrix to a sparse.csr_matrix requires SciPy to read the entire array. Which in turn makes your operating system allocate all pages and fill them with zeros.

To understand this we can use the following examples: Before importing anything, my ipython kernel is at

# 2GB VIRT, 44MB PHYS

We allocate memory but fill it with zeros, so we haven't touched anything. We use a lot of VIRT but almost no PHYS RAM.

import numpy
array = numpy.zeros((10000, 50000))
# 6GB VIRT, 50MB PHYS

After setting our diagonal with random values, we see only a slight increase in PHYS, because most of our pages are not yet actually physically there in RAM.

# this is a more efficient way of setting the main diagonal of your array, by the way
array[numpy.arange(10000), numpy.arange(10000)] = numpy.random.rand(10000)
# 6GB VIRT, 90MB PHYS

If we now calculate the sum of our array suddenly the usage increases, because reading the memory triggers a phyiscal allocation:

numpy.sum(array)
# 6GB VIRT, 4GB PHYS

And the same when creating an array filled with random values: all of it is immediately physically allocated.

array = numpy.random.rand(10000, 50000)
# 6GB VIRT, 4GB PHYS

This is the reason why creating sparse arrays directly in a sparse format is recommended:

import scipy.sparse
sparse_array = scipy.sparse.dok_matrix((10000, 50000))
# 2GB VIRT, 50MB PHYS

DOK allows indexing, so we can efficiently do

sparse_array[numpy.arange(10000), numpy.arange(10000)] = numpy.random.rand(10000)
# 2GB VIRT, 54MB PHYS

and allows efficient transform to CSR:

csr_sparse_array = scipy.sparse.csr_matrix(sparse_array)
# 2GB VIRT, 54MB PHYS

These values were calculated on OSX, but the general principle holds for Linux, OSX and Windows.