Search code examples
python-3.xpandasdataframeexceptionapply

Recording/capturing exceptions thrown when using DataFrame.Apply()


I have a DataFrame in which several columns need converting/parsing via a custom function. I can do this very easily with DataFrame.Apply() by using the following function:

def process_row(row) -> dict:
   try:
       newVal = my_conversion(row['oldCol'])
       out_row= {'newCol': newVal}
   except Exception as inst:
       # Sometimes my_conversion() throws exceptions, return NaN
       out_row = {'newCol': Nan}
       # Print error, but can we capture it???
       print(repr(inst))
   return out_row

This returns a nice new dictionary with the new column name and value for the processed row. I can then apply the conversion function to my DataFrame as such:

a = df.apply(lambda x: process_row(x), axis=1, result_type="expand"
df1 = df.merge(a, left_index=True, right_index=True)

The result works perfectly. I get my original data frame with a new column called "newCol" which contains the converted value for each row.

Occasionally the conversion may throw an exception if the data is valid. An error gets printed to the screen, and the converted value is recorded as NaN. From the DataFrame's perspective mission accomplished.

HOWEVER, I would like to capture these errors in a separate data structure rather than just print them. How can I capture some of the error information and return it from the lambda function so I can then store it in its own dataframe (or list)? (all of the multiple return value questions I have seen are adding the values returned to the original DataFrame. I don't want to add the errors to the original DataFrame)


Solution

  • Pass a list or dictionary to your function to save the exceptions:

    def process_row(row, ex=None) -> dict:
        try:
            newVal = int(row['oldCol'])
            out_row= {'newCol': newVal}
        except Exception as inst:
            # Sometimes my_conversion() throws exceptions, return NaN
            out_row = {'newCol': float('nan')}
            # save exception to the list
            if isinstance(ex, list):
                ex.append(inst)
            elif isinstance(ex, dict):
                ex[row.name] = inst
        return out_row
    
    df = pd.DataFrame({'oldCol': [1,2,'a',None]})
    
    lst = []
    out = df.apply(process_row, axis=1, result_type="expand", ex=lst)
    print(out)
    print(lst)
    

    Output:

    # out
      newCol
    0     1.0
    1     2.0
    2     NaN
    3     NaN
    
    # lst
    [ValueError("invalid literal for int() with base 10: 'a'"),
     TypeError("int() argument must be a string, a bytes-like object or a real number, not 'NoneType'")]
    

    With a dictionary:

    dic = {}
    out = df.apply(process_row, axis=1, result_type="expand", ex=dic)
    print(dic)
    
    {2: ValueError("invalid literal for int() with base 10: 'a'"),
     3: TypeError("int() argument must be a string, a bytes-like object or a real number, not 'NoneType'")}