Search code examples
pythonjupyter-notebookdata-sciencedata-preprocessing

Recursion error: Dataprep function not working post cleaning data


Using dataprep API and I am getting a recursion error when I use the dataprep functions in Google Colab. Oddly it works fine on 144 features of uncleaned data. But once reduced to 20 features and clean the missing values, I get a recursion error

Code:

df.isna().sum()
Output:
rade                      0
sub_grade                 0
emp_length                0
home_ownership            0
annual_inc                0
verification_status       0
loan_status               0
purpose                   0
dti                       0
delinq_2yrs               0
inq_last_6mths            0
mths_since_last_delinq    0
open_acc                  0
pub_rec                   0
revol_bal                 0
revol_util                0
total_acc                 0
recoveries                0
pub_rec_bankruptcies      0
tax_liens                 0
dtype: int64

sys.setrecursionlimit(15000)
    
from dataprep.eda import create_report, plot, plot_correlation

create_report(df)

Error:

---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-55-463fb2fdfb17> in <module>
----> 1 create_report(df)

33 frames
... last 10 frames repeated, from the frame below ...

/usr/local/lib/python3.8/dist-packages/pandas/core/series.py in __repr__(self)
   1463         show_dimensions = get_option("display.show_dimensions")
   1464 
-> 1465         self.to_string(
   1466             buf=buf,
   1467             name=self.name,

RecursionError: maximum recursion depth exceeded

Following the advice of the first answer, I was able to go through one series at a time and it looks like this code is causing the issue. How can this be written better?

# these columns will take the median value for fillna
median_fill = ['emp_length','annual_inc','open_acc','pub_rec','open_acc','revol_util','total_acc']
for med in median_fill:
  df[med].fillna(df[med].median,inplace=True)

Solution

  • You omitted important details from the stack trace.

    But if I had to guess, here's what's happening.

    Something in create_report wound up calling repr(foo), where foo is a complex custom object.

    In the course of computing self.to_string( ... ) we wound up accidentally calling either to_string or repr(foo) again. Essentially a while True: loop. So .setrecursionlimit() won't help.


    You want to understand what foo is all about, in order to properly diagnose the root cause and then fix this.

    Start with a simpler report, and build up to the point where you trigger the error.


    EDIT

    You wrote

      df[med].fillna(df[med].median, inplace=True)
    

    Don't do that. Rather than inplace, prefer this:

      df[med] = df[med].fillna(df[med].median)