Search code examples
pythonpandasdata-sciencedata-cleaning

pandas.to_numeric - find out which string it was unable to parse


Applying pandas.to_numeric to a dataframe column which contains strings that represent numbers (and possibly other unparsable strings) results in an error message like this:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-66-07383316d7b6> in <module>()
      1 for column in shouldBeNumericColumns:
----> 2     trainData[column] = pandas.to_numeric(trainData[column])

/usr/local/lib/python3.5/site-packages/pandas/tools/util.py in to_numeric(arg, errors)
    113         try:
    114             values = lib.maybe_convert_numeric(values, set(),
--> 115                                                coerce_numeric=coerce_numeric)
    116         except:
    117             if errors == 'raise':

pandas/src/inference.pyx in pandas.lib.maybe_convert_numeric (pandas/lib.c:53558)()

pandas/src/inference.pyx in pandas.lib.maybe_convert_numeric (pandas/lib.c:53344)()

ValueError: Unable to parse string

Wouldn't it be helpful to see which value failed to parse?


Solution

  • I think you can add parameter errors='coerce' for convert bad non numeric values to NaN, then check this values by isnull and use boolean indexing:

    print (df[pd.to_numeric(df.col, errors='coerce').isnull()])
    

    Sample:

    df = pd.DataFrame({'B':['a','7','8'],
                       'C':[7,8,9]})
    
    print (df)
       B  C
    0  a  7
    1  7  8
    2  8  9
    
    print (df[pd.to_numeric(df.B, errors='coerce').isnull()])
       B  C
    0  a  7
    

    Or if need find all string in mixed column - numerice with string values check type of values if is string:

    df = pd.DataFrame({'B':['a',7, 8],
                       'C':[7,8,9]})
    
    print (df)
       B  C
    0  a  7
    1  7  8
    2  8  9
    
    print (df[df.B.apply(lambda x: isinstance(x, str))])
       B  C
    0  a  7