Search code examples
python-3.xpandaslambdamapping

Mapping strip() to strings in pandas dataframe does not change NaN entries but still claims that they are different?


I have a dataframe where I have very different kinds of entries (text, integers, floats, times, etc.) and I am trying to delete leading and trailing whitespaces from text entries so that my other code would work as expected. However, my code does not seem to work.

Here is a simple example of what I'm trying to do:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(np.array(([np.nan, 2, 3], [4, 5, 6])), columns=["one", "two", "three"])
print(df1)
print("")
df2 = df1.map(lambda x: x.strip() if isinstance(x, str) else x)
print(df2)
print("")
print(df1==df2)
print("")
cell1 = df1.at[0, "one"]
cell2 = df2.at[0, "one"]
print(cell1, type(cell1))
print(cell2, type(cell2))
print(cell1==cell2)

When I run this code, the output is:

   one  two  three
0  NaN  2.0    3.0
1  4.0  5.0    6.0

   one  two  three
0  NaN  2.0    3.0
1  4.0  5.0    6.0

     one   two  three
0  False  True   True
1   True  True   True

nan <class 'numpy.float64'>
nan <class 'numpy.float64'>
False

As you can see, df1 and df2 have exactly the same entires (NaN) but the code block print(cell1==cell2) claims that these cells are different.

What is going on in here?


Solution

  • Thats how floats work, you can't compare directly NaNs (Why is NaN not equal to NaN?)

    Use Dataframe.equals to compare the dataframes:

    df1 = pd.DataFrame(
        np.array(([np.nan, 2, 3], [4, 5, 6])), columns=["one", "two", "three"]
    )
    
    df2 = df1.map(lambda x: x.strip() if isinstance(x, str) else x)
    
    print(df1.equals(df2))
    

    Prints:

    True