Search code examples
pandasdataframenumericcoerce

how to regularize numeric/non-numeric entries in pandas dataframes


I want to control for non-numeric entries in my pandas dataframe. Say I have the following:

>>> df
   col_1  col_2  col_3
0   0.01    NaN    0.1
1    NaN    0.9    0.2
2   0.01    NaN    0.3
3   0.01    0.9    0.4

I can take the row-means as follows, while python properly skips over the NaN values:

>>> df.mean(axis=1)
0    0.055000
1    0.550000
2    0.155000
3    0.436667
dtype: float64

Great!. but now suppose one of the values from my imported table is a string

>>> df.iloc[0,1]="str1"
>>> df
   col_1 col_2  col_3
0   0.01  str1    0.1
1    NaN   0.9    0.2
2   0.01   NaN    0.3
3   0.01   0.9    0.4
>>> df.mean(axis=1)
0    0.055
1    0.200
2    0.155
3    0.205
dtype: float64

DANGER: the output looks plausible, but is wrong, because once I changed the value in position [0,1] to a string, the values in position [1,1] and [3,1] changed from being the number 0.9 to become the string "0.9", and all the strings are omitted from averaging (I guess each column has to be of the same type? there's probably a reason, but boy this is dangerously subtle.)

What I want to do now is force all the entries of the dataframe back into numeric type. Anything that can be sensibly coerced into a number should become that number, and anything that cannot should become nan (regardless of what string or type it might have been).

Pandas series have a function pandas.to_numeric where you can set errors='coerce', but unfortunately the analogous function for df's (DataFrame.astype()) doesn't allow this option.

Is there a function for "make every element of the dataFrame that looks like a number numeric, and make everything else nan"?


Solution

  • I think you can use to_numeric on a subset of columns with apply. This answer might help.