Search code examples
pythonnumpyscipysparse-matrix

Why is the scipy.sparse.csr_matrix not storing all the values being passed to it?


So I am currently trying to store a large sparse dataset (4.9 million rows and 6000 columns) in the csr_format. The dense format causes a memory error so I am loading it in line by line from a tsv file. Here is how I do that:

import numpy as np
from scipy.sparse import csr_matrix
rows=np.empty(4865518,dtype=np.int16)
cols=np.empty(165050535,dtype=np.int16)
values=np.empty(165050535,dtype=np.int16)
labels=np.empty(4865517,dtype=np.int8)
file=open(r'HI-union-allFeatures\HI-union-allFeatures-nonZero-train0.tsv','r')
count=0
nnz=0
col_count=0
for l in file:
    if count>0:
        l=l.strip().split("\t")
        line=l[2:-1]
        labels[count-1]=l[-1]
        for pair in line:
            pair=pair.split()
            cols[col_count]=pair[0]
            cols[col_count]-=3
            values[col_count]=pair[1]
            col_count+=1
        nnz+=len(line)
        rows[count]=nnz        
    count+=1
cols.astype(np.int16,copy=False) #cols gets stored as 32 bit for some reason.
cols.shape #(165050535,)
rows.shape #(4865518,)
values.shape #(165050535,)
data=csr_matrix((values, cols, rows),copy=False)
data.nnz #30887
data.data.shape #should match values.shape but output is (30887,)
data.indices.shape #should match cols.shape but output is (30887,)
data.indptr.shape #matches rows.shape (4865518,)

However after creating the csr_matrix, it just elminates some of the values. I dont understand why. Here is the screenshot showing that data.data.shape does not match values.shape. I also verified the data in the orginal rows, cols and values arrays and they represent the data perfectly so I dont understand this behaviour. My pc is not running out of memory, I have 16gb of ram and this program barely takes up 1 GB. EDIT : This is my first question here so I'm sorry if I didnt post it correctly. Any help would be great. Link to the screenshot


Solution

  • np.empty doesn't initialize arrays to zero. The value of rows[0] could be anything.

    empty, unlike zeros, does not set the array values to zero, and may therefore be marginally faster. On the other hand, it requires the user to manually set all the values in the array, and should be used with caution

    Int16 has a maximum value of 32767. Your row pointers have a maximum value of 165 million. This is why your data is now smaller than an int16.

    Both of these things are huge problems. Without example data, providing a working fix as an answer is not possible.