so I've got a very large dataframe of mostly floats (read from a csv) but every now and then, I get a string, or nan
date load
0 2016-07-12 19:04:31.604999 0
...
10 2016-07-12 19:04:31.634999 nan
...
50 2016-07-12 19:04:31.664999 ".942.197"
...
I can deal with nans (interpolate), but can't figure out how to use replace in order to catch strings, and not numbers
df.replace(to_replace='^[a-zA-Z0-9_.-]*$',regex=True,value = float('nan'))
returns all nans. I wan't nans for only when it's actually a string
I think you want pandas.to_numeric
. It works with series-like data.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([0, float('nan'), '.942.197'], columns=['load'])
In [3]: df
Out[3]:
load
0 0
1 NaN
2 .942.197
In [4]: pd.to_numeric(df['load'], errors='coerce')
Out[4]:
0 0.0
1 NaN
2 NaN
Name: load, dtype: float64
Actually to_numeric
will try to convert every item to numeric so if you have a string that looks like a number it will be converted:
In [5]: df = pd.DataFrame([0, float('nan'), '123.456'], columns=['load'])
In [6]: df
Out[6]:
load
0 0
1 NaN
2 123.456
In [7]: pd.to_numeric(df['load'], errors='coerce')
Out[7]:
0 0.000
1 NaN
2 123.456
Name: load, dtype: float64
I am not aware of any way to convert every non-numeric type to nan
, other than iterate (or maybe use apply
or map
) and check for isinstance
.