Search code examples
pythonpandassparse-matrixsparse-dataframe

Changing the fill_values in a SparseDataFrame - replace throws TypeError


Current pandas version: 0.22


I have a SparseDataFrame.

A = pd.SparseDataFrame(
    [['a',0,0,'b'],
     [0,0,0,'c'],
     [0,0,0,0],
     [0,0,0,'a']])

A

   0  1  2  3
0  a  0  0  b
1  0  0  0  c
2  0  0  0  0
3  0  0  0  a

Right now, the fill values are 0. However, I'd like to change the fill_values to np.nan. My first instinct was to call replace:

A.replace(0, np.nan)

But this gives

TypeError: cannot convert int to an sparseblock

Which doesn't really help me understand what I'm doing wrong.

I know I can do

A.to_dense().replace(0, np.nan).to_sparse()

But is there a better way? Or is my fundamental understanding of Sparse dataframes flawed?


Solution

  • tl;dr : That's definitely a bug.
    But please keep reading, there is more than that...

    All the following works fine with pandas 0.20.3, but not with any newer version:

    A.replace(0,np.nan)
    A.replace({0:np.nan})
    A.replace([0],[np.nan])
    

    etc... (you get the idea).

    (from now on, all the code is done with pandas 0.20.3).

    However, those (along with most the workarounds I tried) works because we accidentally did something wrong. You'll guess it right away if we do this:

    A.density
    
    1.0
    

    This SparseDataFrame is actually dense!
    We can fix this by passing default_fill_value=0 :

    A = pd.SparseDataFrame(
         [['a',0,0,'b'],
         [0,0,0,'c'],
         [0,0,0,0],
         [0,0,0,'a']],default_fill_value=0)
    

    Now A.density will output 0.25 as expected.

    This happened because the initializer couldn't infer the dtypes of the columns. Quoting from pandas docs:

    Sparse data should have the same dtype as its dense representation. Currently, float64, int64 and bool dtypes are supported. Depending on the original dtype, fill_value default changes:

    • float64: np.nan
    • int64: 0
    • bool: False

    But the dtypes of our SparseDataFrame are:

    A.dtypes
    
    0    object
    1    object
    2    object
    3    object
    dtype: object
    

    And that's why SparseDataFrame couldn't decide which fill value to use, and thus used the default np.nan.

    OK, so now we have a SparseDataFrame. Let's try to replace some entries in it:

    A.replace('a','z')
        0   1   2   3
    0   z   0   0   b
    1   0   0   0   c
    2   0   0   0   0
    3   0   0   0   z
    
    And strangely:
    A.replace(0,np.nan)
        0   1   2   3
    0   a   0   0   b
    1   0   0   0   c
    2   0   0   0   0
    3   0   0   0   a
    
    And that's as you can see, is not correct!
    From my own experiments with different versions of pandas, it seems that SparseDataFrame.replace() works only with non-fill values. To change the fill value, you have the following options:

    • According to pandas docs, if you change the dtypes, that will automatically change the fill value. (That didn't work with me).
    • Convert into a dense DataFrame, do the replacement, then convert back into SparseDataFrame.
    • Manually reconstruct a new SparseDataFrame, like Wen's answer, or by passing default_fill_value set to the new fill value.

    While I was experimenting with the last option, something even stranger happened:

    B = pd.SparseDataFrame(A,default_fill_value=np.nan)
    
    B.density
    0.25
    
    B.default_fill_value
    nan
    

    So far, so good. But... :

    B
        0   1   2   3
    0   a   0   0   b
    1   0   0   0   c
    2   0   0   0   0
    3   0   0   0   a
    

    That really shocked me at first. Is that even possible!?
    Continuing on, I tried to see what is happening in the columns:

    B[0]
    
    0    a
    1    0
    2    0
    3    0
    Name: 0, dtype: object
    BlockIndex
    Block locations: array([0], dtype=int32)
    Block lengths: array([1], dtype=int32)
    

    The dtype of the column is object, but the dtype of the BlockIndex associated with it is int32, hence the strange behavior.
    There is a lot more "strange" things going on, but I'll stop here.
    From all the above, I can say that you should avoid using SparseDataFrame till a complete re-write for it takes place :).