Search code examples
pythonnumpymatrixscipysparse-matrix

Creating a large sparse matrix in scipy.sparse


I am using scipy.sparse in my application and want to do some performance tests. In order to do that, I need to create a large sparse matrix (which I will then use in my application). As long as the matrix is small, I can create it using the command

import scipy.sparse as sp
a = sp.rand(1000,1000,0.01)

Which results in a 1000 by 1000 matrix with 10.000 nonzero entries (a reasonable density meaning approximately 10 nonzero entries per row)

The problem is when I try to create a larger matrix, for example, a 100.000 by 100.000 matrix (I have dealt with way larger matrices before), I run

import scipy.sparse as sp
N = 100000
d = 0.0001
a = sp.rand(N, N, d)

which should result in a 100.000 by 100.000 matrix with one million nonzero entries (way in the realm of possible), I get an error message:

Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    sp.rand(100000,100000,0.0000001)
  File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 723, in rand
    j = random_state.randint(mn)
  File "mtrand.pyx", line 935, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:10327)
OverflowError: Python int too large to convert to C long

Which is some annoying internal scipy error I cannot remove.


I understand that I can create a 10*n by 10*n matrix by creating one hundred n by n matrices, then stacking them together, however, I think that scipy.sparse should be able to handle the creation of large sparse matrices (I say again, 100k by 100k is by no means large, and scipy is more than comfortable handling matrices with several million rows). Am I missing something?


Solution

  • Without getting to the bottom of the issue, you should make sure that you are using a 64 bit build on a 64 bit architecture, on a Linux platform. There, the native "long" data type is of 64 bit size (as opposed to Windows, I believe).

    For reference, see these tables:

    Edit: Maybe I was not explicit enough before -- on a 64 bit Windows, the classical native "long" data type is of 32 bit size (also see this question). This might be a problem in your case. That is, your code might just work when you change platform to Linux. I cannot say this with absolute certainty, because it really depends on which native data types are used in the numpy/scipy C source (of course there are 64 bit data types available on Windows, and usually a platform case analysis is performed with compiler directives, and proper types are chosen via macros -- I cannot really imagine that they've used 32 bit data types by accident).

    Edit 2:

    I can provide three data samples supporting my hypothesis.

    Debian 64 bit, Python 2.7.3 and SciPy 0.10.1 binaries from Debian repos:

    Python 2.7.3 (default, Mar 13 2014, 11:03:55)
    [GCC 4.7.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import scipy; print scipy.__version__; import scipy.sparse as s; s.rand(100000, 100000, 0.0001).shape
    0.10.1
    (100000, 100000)
    

    Windows 7 64 bit, 32 bit Python build, 32 bit SciPy 0.10.1 build, both from ActivePython:

    ActivePython 2.7.5.6 (ActiveState Software Inc.) based on
    Python 2.7.5 (default, Sep 16 2013, 23:16:52) [MSC v.1500 32 bit (Intel)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import scipy; print scipy.__version__; import scipy.sparse as s; s.rand(100000, 100000, 0.0001).shape
    0.10.1
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Users\user\AppData\Roaming\Python\Python27\site-packages\scipy\sparse\construct.py", line 426, in rand
        raise ValueError(msg % np.iinfo(tp).max)
    ValueError: Trying to generate a random sparse matrix such as the product of dimensions is
    greater than 2147483647 - this is not supported on this machine
    

    Windows 7 64 bit, 64 bit ActivePython build, 64 bit SciPy 0.15.1 build (from Gohlke, build against MKL):

    ActivePython 3.4.1.0 (ActiveState Software Inc.) based on
    Python 3.4.1 (default, Aug  7 2014, 13:09:27) [MSC v.1600 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import scipy; scipy.__version__; import scipy.sparse as s; s.rand(100000, 100000, 0.0001).shape
    '0.15.1'
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "C:\Python34\lib\site-packages\scipy\sparse\construct.py", line 723, in rand
        j = random_state.randint(mn)
      File "mtrand.pyx", line 935, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:10327)
    OverflowError: Python int too large to convert to C long