I have two dataframes in sparse format with slightly different indices and columns. I need a coo version of a concatenated dataframe of both dataframes. When I try to generate a coo matrix from them I get zeros in the data
attribute which is unexpected to me. Even if all the columns have a pd.SparseDtype("float",0)
data type. It seems the fillna(0) method introduces some zeros that creep into the data of sparse formats... that shouldn't cointain zeros.
here some code to reproduce the behaviour
import pandas as pd
A = pd.DataFrame(np.eye(3),index=['a','b','c'],
columns=['a','b','c']).astype(pd.SparseDtype("float",0))
B = pd.DataFrame(np.random.normal(size=(2,2)),
index=['d','e'],
columns=['a','b'],
).astype(pd.SparseDtype("float",0))
c = pd.concat([A,B],axis=0).fillna(0)
in that example both c.sparse.to_coo().data
or simply c.c.sparse.sp_values
cointain zeros, which defeats the purpose of using a sparse data format. I don't understand what is going on. How can I concatenate dataframes in sparse format and not get these taking space? I am using pandas version 2.2.2
Your dataframes and coo versions:
In [253]: A
Out[253]:
a b c
a 1.0 0.0 0.0
b 0.0 1.0 0.0
c 0.0 0.0 1.0
In [254]: print(A.sparse.to_coo())
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
In [255]: B
Out[255]:
a b
d 1.300532 0.507797
e 2.716205 0.350493
In [256]: print(B.sparse.to_coo())
(0, 0) 1.3005317919570971
(1, 0) 2.716205281906014
(0, 1) 0.5077973046544041
(1, 1) 0.3504927594184647
And the concatenated version:
In [257]: c
Out[257]:
a b c
a 1.000000 0.000000 0.0
b 0.000000 1.000000 0.0
c 0.000000 0.000000 1.0
d 1.300532 0.507797 0.0
e 2.716205 0.350493 0.0
In [258]: print(c.sparse.to_coo())
(0, 0) 1.0
(3, 0) 1.3005317919570971
(4, 0) 2.716205281906014
(1, 1) 1.0
(3, 1) 0.5077973046544041
(4, 1) 0.3504927594184647
(2, 2) 1.0
(3, 2) 0.0
(4, 2) 0.0
So the (3,2) and (4,2) come from the fillna
. That would be more obvious if you used a different fill.
A sparse matrix does have a method for removing extra zeros. That isn't automatic. And I don't know the pandas sparse code well enough to pass this back to a dataframe.
In [259]: M=c.sparse.to_coo()
In [260]: M.eliminate_zeros(); print(M)
(0, 0) 1.0
(3, 0) 1.3005317919570971
(4, 0) 2.716205281906014
(1, 1) 1.0
(3, 1) 0.5077973046544041
(4, 1) 0.3504927594184647
(2, 2) 1.0
In [261]: M.A
Out[261]:
array([[1. , 0. , 0. ],
[0. , 1. , 0. ],
[0. , 0. , 1. ],
[1.30053179, 0.5077973 , 0. ],
[2.71620528, 0.35049276, 0. ]])
Using other fillna value:
In [262]: pd.concat([A,B],axis=0).fillna(np.nan)
Out[262]:
a b c
a 1.000000 0.000000 0.0
b 0.000000 1.000000 0.0
c 0.000000 0.000000 1.0
d 1.300532 0.507797 NaN
e 2.716205 0.350493 NaN
Pandas sparse code has been somewhat experimental, so it's not surprising that steps like concat
miss details like this. Evidently there's isn't any code to check if the fillna
value is the same as the sparse fill, and do this sort of cleanup.