Search code examples
pythonpandasdataframesetnan

Are there different string nan values?


I have a dataframe with two columns for which I compare the values. The rows for these different values and the values themselves are saved in a new dataframe.

Dataframe before comparing:

other columns value_a value_b other columns
... 12 12 ...
... 1.3 1.6 ...
... abc def ...

Dataframe after comparing:

other columns value_a value_b other columns
... 1.3 1.6 ...
... abc def ...

The problem is that I also get the following lines:

other columns value_a value_b other columns
... ...
... ...

Empty cells are compared with each other and reported as non-matching.

Now I have created a set for each of the columns value_a and value_b to see which values occur in the columns. I used the following code for this:

df2['non-numeric_a'] = df['value_a'].mask(df['value_a'].notna())

df2['non-numeric_b'] = df['value_b'].mask(df['value_b'].notna())

Then I looked at the columns as a set, because I wanted to see the unique values that occur for each column: print(set( df2['non-numeric_a'])) print(set( df2['non-numeric_b']))

My output for the sets was: {nan} and {nan, nan, nan, ..., nan}


Solution

  • A NaN is not equal to itself, thus set([float('nan'), float('nan')]) -> {nan, nan}

    Rather dropna before converting to set:

    set(df2['non-numeric_b'].dropna()))
    

    Or:

    set(df2['non-numeric_b'].unique())
    

    Example:

    s= pd.Series([float('nan'), float('nan'), 1, 1, 2, 3])
    
    set(s)
    # {nan, nan, 1.0, 2.0, 3.0}
    
    set(s.dropna())
    # {1.0, 2.0, 3.0}
    
    set(s.unique())
    # {nan, 1.0, 2.0, 3.0}