Search code examples
pythonpandasnumpyscipysparse-matrix

How can I "sparsify" on two values?


consider the the pandas series s

n = 1000
s = pd.Series([0] * n + [1] * n, dtype=int)

s.memory_usage()

8080

I can "sparsify" this by using to_sparse

s.to_sparse(fill_value=0).memory_usage()

4080

But I only have 2 types of integers. I'd think I could sparsify twice. Is there a way to do this?


Solution

  • Since you tagged this with scipy, I'll show you what a scipy.sparse matrix is like:

    In [31]: n=100
    In [32]: arr=np.array([[0]*n+[1]*n],int)
    In [33]: M=sparse.csr_matrix(arr)
    In [34]: M.data
    Out[34]: 
    array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
    In [35]: M.indices
    Out[35]: 
    array([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
           113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
           126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
           139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
           152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
           165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177,
           178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
           191, 192, 193, 194, 195, 196, 197, 198, 199], dtype=int32)
    In [36]: M.indptr
    Out[36]: array([  0, 100], dtype=int32)
    

    It has replaced the n elements of arr with 2 arrays each with n/2 elements. Even if I replace the int with uint8, the M.indices array will still be int32.

    The fact that your pandas version has half the memory usage,suggests that it is just storing the indices, and some how noting that the data part is all 1s. But that's just a guess.

    How much greater sparification do you expect?

    ====================

    http://pandas.pydata.org/pandas-docs/stable/sparse.html

    This example looks like pandas is implementing some sort of 'run' compression:

    In [4]: sts
    Out[4]: 
    0    0.469112
    1   -0.282863
    2         NaN
    3         NaN
    4         NaN
    5         NaN
    6         NaN
    7         NaN
    8   -0.861849
    9   -2.104569
    dtype: float64
    BlockIndex
    Block locations: array([0, 8], dtype=int32)
    Block lengths: array([2, 2], dtype=int32)
    

    It has identified 2 blocks, of length 2 each. It still has to store the 4 nonfill values in some array.

    A csr sparse equivalent (for a row array):

    In [1052]: arr=np.random.rand(10)
    In [1053]: arr[2:-2]=0
    In [1055]: M=sparse.csr_matrix(arr)
    In [1056]: M
    Out[1056]: 
    <1x10 sparse matrix of type '<class 'numpy.float64'>'
        with 4 stored elements in Compressed Sparse Row format>
    In [1057]: M.data
    Out[1057]: array([ 0.37875012,  0.73703368,  0.7935645 ,  0.22948213])
    In [1058]: M.indices
    Out[1058]: array([0, 1, 8, 9], dtype=int32)
    In [1059]: M.indptr
    Out[1059]: array([0, 4], dtype=int32)
    

    The pandas version might be more compact if the fill values occur in blocks. But I suspect

    0         1.0
    1         1.0
    2         NaN
    3         NaN
    4         NaN
    5         NaN
    6         NaN
    7         NaN
    8         1.0
    9         1.0
    

    would produce the same blocks. I don't see evidence of it's trying to identify the identical 1.0 values, and storing those as a value plus a count.

    ================

    Based on @MaxU answer your ds stores 1000 1's, and two single element arrays that tell it where those values are stored.

    In [56]: sp.memory_usage()
    Out[56]: 1080
    
    In [57]: sp.sp_index
    Out[57]:
    BlockIndex
    Block locations: array([1000])
    Block lengths: array([1000])
    

    As long the nonfill values occur in big runs, the block arrays will be small. But if you scattered those 1000 values through out the series, you'd multiply the number of blocks substantially

     block locations: array([1,3,6,10,...])
     block lengths: array([1,1,1,2,1,...])
    

    I can imagine mapping between the csr layout and the pandas blocks, but haven't worked out the details. The csr layout is meant to work with 2d arrays, with a clear concept of rows and columns. Looks like a sparse dataframe just contains sparse series objects.

    ===================

    https://stackoverflow.com/a/38157234/901925 shows how to map from sparse dataframe values to a scipy sparse matrix. For each column (data series) it uses sp_values,fill_value,sp_index.

    pandas/pandas/sparse/scipy_sparse.py has the code for interaction between scipy sparse and data series.

    ===================

    kind='integer' produces sparse structure more likescipy.sparse`:

    In [62]: n=5; s=pd.Series([0]*5+[1]*5, dtype=int)
    In [63]: ss=s.to_sparse(fill_value=0, kind='integer')
    In [64]: ss
    Out[64]: 
    0    0
    1    0
    2    0
    3    0
    4    0
    5    1
    6    1
    7    1
    8    1
    9    1
    dtype: int32
    IntIndex
    Indices: array([5, 6, 7, 8, 9])
    

    contrast that with the default block:

    dtype: int32
    BlockIndex
    Block locations: array([5])
    Block lengths: array([5])
    

    And equivalent column sparse matrix can be built with:

    In [89]: data=ss.values
    In [90]: data=ss.sp_values
    In [91]: rows=ss.sp_index.indices
    In [92]: cols=np.zeros_like(rows)
    In [93]: sparse.csr_matrix((data,(rows,cols)))
    Out[93]: 
    <10x1 sparse matrix of type '<class 'numpy.int32'>'
        with 5 stored elements in Compressed Sparse Row format>
    

    There is a to_coo method, but it only works with the more complex pd.MultiIndex object (why?).