Search code examples
pythonpandasdata-cleaning

python, pandas, work through bad data


so I've got a very large dataframe of mostly floats (read from a csv) but every now and then, I get a string, or nan

                         date load
0  2016-07-12 19:04:31.604999    0
...
10 2016-07-12 19:04:31.634999    nan
...
50 2016-07-12 19:04:31.664999    ".942.197"
...

I can deal with nans (interpolate), but can't figure out how to use replace in order to catch strings, and not numbers

df.replace(to_replace='^[a-zA-Z0-9_.-]*$',regex=True,value = float('nan'))

returns all nans. I wan't nans for only when it's actually a string


Solution

  • I think you want pandas.to_numeric. It works with series-like data.

    In [1]: import pandas as pd
    
    In [2]: df = pd.DataFrame([0, float('nan'), '.942.197'], columns=['load'])
    
    In [3]: df
    Out[3]: 
           load
    0         0
    1       NaN
    2  .942.197
    
    In [4]: pd.to_numeric(df['load'], errors='coerce')
    Out[4]: 
    0    0.0
    1    NaN
    2    NaN
    Name: load, dtype: float64
    

    Actually to_numeric will try to convert every item to numeric so if you have a string that looks like a number it will be converted:

    In [5]: df = pd.DataFrame([0, float('nan'), '123.456'], columns=['load'])
    
    In [6]: df
    Out[6]: 
          load
    0        0
    1      NaN
    2  123.456
    
    In [7]: pd.to_numeric(df['load'], errors='coerce')
    Out[7]: 
    0      0.000
    1        NaN
    2    123.456
    Name: load, dtype: float64
    

    I am not aware of any way to convert every non-numeric type to nan, other than iterate (or maybe use applyor map) and check for isinstance.