Current pandas version: 0.22
I have a SparseDataFrame.
A = pd.SparseDataFrame(
[['a',0,0,'b'],
[0,0,0,'c'],
[0,0,0,0],
[0,0,0,'a']])
A
0 1 2 3
0 a 0 0 b
1 0 0 0 c
2 0 0 0 0
3 0 0 0 a
Right now, the fill values are 0
. However, I'd like to change the fill_values to np.nan
. My first instinct was to call replace
:
A.replace(0, np.nan)
But this gives
TypeError: cannot convert int to an sparseblock
Which doesn't really help me understand what I'm doing wrong.
I know I can do
A.to_dense().replace(0, np.nan).to_sparse()
But is there a better way? Or is my fundamental understanding of Sparse dataframes flawed?
tl;dr : That's definitely a bug.
But please keep reading, there is more than that...
All the following works fine with pandas 0.20.3, but not with any newer version:
A.replace(0,np.nan)
A.replace({0:np.nan})
A.replace([0],[np.nan])
etc... (you get the idea).
(from now on, all the code is done with pandas 0.20.3).
However, those (along with most the workarounds I tried) works because we accidentally did something wrong. You'll guess it right away if we do this:
A.density
1.0
This SparseDataFrame is actually dense!
We can fix this by passing default_fill_value=0
:
A = pd.SparseDataFrame(
[['a',0,0,'b'],
[0,0,0,'c'],
[0,0,0,0],
[0,0,0,'a']],default_fill_value=0)
Now A.density
will output 0.25
as expected.
This happened because the initializer couldn't infer the dtypes of the columns. Quoting from pandas docs:
Sparse data should have the same dtype as its dense representation. Currently, float64, int64 and bool dtypes are supported. Depending on the original dtype, fill_value default changes:
- float64: np.nan
- int64: 0
- bool: False
But the dtypes of our SparseDataFrame are:
A.dtypes
0 object
1 object
2 object
3 object
dtype: object
And that's why SparseDataFrame couldn't decide which fill value to use, and thus used the default np.nan
.
OK, so now we have a SparseDataFrame. Let's try to replace some entries in it:
And strangely:
A.replace('a','z')
0 1 2 3
0 z 0 0 b
1 0 0 0 c
2 0 0 0 0
3 0 0 0 z
And that's as you can see, is not correct!
A.replace(0,np.nan)
0 1 2 3
0 a 0 0 b
1 0 0 0 c
2 0 0 0 0
3 0 0 0 a
From my own experiments with different versions of pandas, it seems that SparseDataFrame.replace()
works only with non-fill values.
To change the fill value, you have the following options:
DataFrame
, do the replacement, then convert back into SparseDataFrame
.SparseDataFrame
, like Wen's answer, or by passing default_fill_value
set to the new fill value.While I was experimenting with the last option, something even stranger happened:
B = pd.SparseDataFrame(A,default_fill_value=np.nan)
B.density
0.25
B.default_fill_value
nan
So far, so good. But... :
B
0 1 2 3
0 a 0 0 b
1 0 0 0 c
2 0 0 0 0
3 0 0 0 a
That really shocked me at first. Is that even possible!?
Continuing on, I tried to see what is happening in the columns:
B[0]
0 a
1 0
2 0
3 0
Name: 0, dtype: object
BlockIndex
Block locations: array([0], dtype=int32)
Block lengths: array([1], dtype=int32)
The dtype of the column is object
, but the dtype of the BlockIndex
associated with it is int32
, hence the strange behavior.
There is a lot more "strange" things going on, but I'll stop here.
From all the above, I can say that you should avoid using SparseDataFrame
till a complete re-write for it takes place :).