I have a dataframe with two columns for which I compare the values. The rows for these different values and the values themselves are saved in a new dataframe.
Dataframe before comparing:
other columns | value_a | value_b | other columns |
---|---|---|---|
... | 12 | 12 | ... |
... | 1.3 | 1.6 | ... |
... | abc | def | ... |
Dataframe after comparing:
other columns | value_a | value_b | other columns |
---|---|---|---|
... | 1.3 | 1.6 | ... |
... | abc | def | ... |
The problem is that I also get the following lines:
other columns | value_a | value_b | other columns |
---|---|---|---|
... | ... | ||
... | ... |
Empty cells are compared with each other and reported as non-matching.
Now I have created a set for each of the columns value_a and value_b to see which values occur in the columns. I used the following code for this:
df2['non-numeric_a'] = df['value_a'].mask(df['value_a'].notna())
df2['non-numeric_b'] = df['value_b'].mask(df['value_b'].notna())
Then I looked at the columns as a set, because I wanted to see the unique values that occur for each column:
print(set( df2['non-numeric_a']))
print(set( df2['non-numeric_b']))
My output for the sets was:
{nan}
and
{nan, nan, nan, ..., nan}
A NaN is not equal to itself, thus set([float('nan'), float('nan')])
-> {nan, nan}
Rather dropna
before converting to set
:
set(df2['non-numeric_b'].dropna()))
Or:
set(df2['non-numeric_b'].unique())
Example:
s= pd.Series([float('nan'), float('nan'), 1, 1, 2, 3])
set(s)
# {nan, nan, 1.0, 2.0, 3.0}
set(s.dropna())
# {1.0, 2.0, 3.0}
set(s.unique())
# {nan, 1.0, 2.0, 3.0}