Search code examples
pythonpandaspicklesparse-matrix

No space benefit using sparse Pandas dataframe despite extremely low density


I am using Python/Pandas to deal with very large and very sparse single-column data frames, but when I pickle them, there is virtually no benefit. If I try the same thing on Matlab, the difference is colossal, so I am trying to understand what is going on.

Using Pandas:

len(SecondBins)
>> 34300801

dense = pd.DataFrame(np.zeros(len(SecondBins)),columns=['Binary'],index=SecondBins)
sparse = pd.DataFrame(np.zeros(len(SecondBins)),columns=['Binary'],index=SecondBins).to_sparse(fill_value=0)

pickle.dump(dense,open('dense.p','wb'))
pickle.dump(sparse,open('sparse.p','wb'))

Looking at the sizes of the pickled files, dense = 548.8MB sparse = 274.4MB

However, when I look at memory usage associated with these variables,

dense.memory_usage()
>>Binary    274406408
>>dtype: int64

sparse.memory_usage()
>>Binary    0
>>dtype: int64

So, for a completely empty sparse vector, there slightly more than 50% savings. Perhaps it was something to do with the fact that variable 'SecondBins' is composed of pd.Timestamp which I use in the Pandas as indices, so I tried a similar procedure using default indices.

dense_defaultindex = pd.DataFrame(np.zeros(len(SecondBins)),columns=['Binary'])
sparse_defaultindex = pd.DataFrame(np.zeros(len(SecondBins)),columns=['Binary']).to_sparse(fill_value=0)

pickle.dump(dense_defaultindex,open('dense_defaultindex.p','wb'))
pickle.dump(sparse_defaultindex,open('sparse_defaultindex.p','wb'))

But it yields the same sizes on disk.

What is pickle doing under the hood? If I create a similar zero-filled vector in Matlab, and save it in a .mat file, it's ~180 bytes!?

Please advise.

Regards


Solution

  • Remember that pandas is labelled data. The column labels and the index labels are essentially specialized arrays, and those arrays take up space. So in practice the index acts as an additional column as far as space usage goes, and the column headings act as an additional row.

    In the dense case, you essentially have two columns, the data and the index. In the sparse case, you have essentially one column, the index (since the spare data column contains close to no data). So from this perspective, you would expect the sparse case to be about half the size of the dense case. And that is what you see in your file sizes.

    In the MATLAB case, however, the data is not labeled. Therefore, the sparse case takes up almost no space. The equivalent to the MATLAB case would be a sparse matrix, not a spare dataframe structure. So if you want to take full advantage of the space savings you should use scipy.sparse, which provides sparse matrix support similar to what you get in MATLAB.