Search code examples
pythonpandasscipy

how to generate a coo matrix from two dataframes in sparse format


I have two dataframes in sparse format with slightly different indices and columns. I need a coo version of a concatenated dataframe of both dataframes. When I try to generate a coo matrix from them I get zeros in the data attribute which is unexpected to me. Even if all the columns have a pd.SparseDtype("float",0) data type. It seems the fillna(0) method introduces some zeros that creep into the data of sparse formats... that shouldn't cointain zeros.

here some code to reproduce the behaviour

    import pandas as pd

    A = pd.DataFrame(np.eye(3),index=['a','b','c'],
    columns=['a','b','c']).astype(pd.SparseDtype("float",0))
    
    B = pd.DataFrame(np.random.normal(size=(2,2)),
                 index=['d','e'],
                 columns=['a','b'],
                 ).astype(pd.SparseDtype("float",0))

    c = pd.concat([A,B],axis=0).fillna(0)

in that example both c.sparse.to_coo().data or simply c.c.sparse.sp_values cointain zeros, which defeats the purpose of using a sparse data format. I don't understand what is going on. How can I concatenate dataframes in sparse format and not get these taking space? I am using pandas version 2.2.2


Solution

  • Your dataframes and coo versions:

    In [253]: A
    Out[253]: 
         a    b    c
    a  1.0  0.0  0.0
    b  0.0  1.0  0.0
    c  0.0  0.0  1.0
    
    In [254]: print(A.sparse.to_coo())
      (0, 0)    1.0
      (1, 1)    1.0
      (2, 2)    1.0
    
    In [255]: B
    Out[255]: 
              a         b
    d  1.300532  0.507797
    e  2.716205  0.350493
    
    In [256]: print(B.sparse.to_coo())
      (0, 0)    1.3005317919570971
      (1, 0)    2.716205281906014
      (0, 1)    0.5077973046544041
      (1, 1)    0.3504927594184647
    

    And the concatenated version:

    In [257]: c
    Out[257]: 
              a         b    c
    a  1.000000  0.000000  0.0
    b  0.000000  1.000000  0.0
    c  0.000000  0.000000  1.0
    d  1.300532  0.507797  0.0
    e  2.716205  0.350493  0.0
    
    In [258]: print(c.sparse.to_coo())
      (0, 0)    1.0
      (3, 0)    1.3005317919570971
      (4, 0)    2.716205281906014
      (1, 1)    1.0
      (3, 1)    0.5077973046544041
      (4, 1)    0.3504927594184647
      (2, 2)    1.0
      (3, 2)    0.0
      (4, 2)    0.0
    

    So the (3,2) and (4,2) come from the fillna. That would be more obvious if you used a different fill.

    A sparse matrix does have a method for removing extra zeros. That isn't automatic. And I don't know the pandas sparse code well enough to pass this back to a dataframe.

    In [259]: M=c.sparse.to_coo()
    In [260]: M.eliminate_zeros(); print(M)
      (0, 0)    1.0
      (3, 0)    1.3005317919570971
      (4, 0)    2.716205281906014
      (1, 1)    1.0
      (3, 1)    0.5077973046544041
      (4, 1)    0.3504927594184647
      (2, 2)    1.0
    
    In [261]: M.A
    Out[261]: 
    array([[1.        , 0.        , 0.        ],
           [0.        , 1.        , 0.        ],
           [0.        , 0.        , 1.        ],
           [1.30053179, 0.5077973 , 0.        ],
           [2.71620528, 0.35049276, 0.        ]])
    

    Using other fillna value:

    In [262]: pd.concat([A,B],axis=0).fillna(np.nan)
    Out[262]: 
              a         b    c
    a  1.000000  0.000000  0.0
    b  0.000000  1.000000  0.0
    c  0.000000  0.000000  1.0
    d  1.300532  0.507797  NaN
    e  2.716205  0.350493  NaN
    

    Pandas sparse code has been somewhat experimental, so it's not surprising that steps like concat miss details like this. Evidently there's isn't any code to check if the fillna value is the same as the sparse fill, and do this sort of cleanup.