Search code examples
pythonpandasregressionnancategorical-data

What would the best way to handle NaN values for both numerical and categorical data


I am about creating a regression model but I wonder what would be the best way to handle nan values for both numerical and categorical data:

I know that for the numerical columns the next solutions could be useful:

1- Replace it with 0: df.fillna(0, inplace=True)

2- Replace it with mean: df.fillna(df.mean(), inplace=True)

3- Replace it with median df.fillna(df.median(), inplace=True)

4- delete each row, in my target column, has nan value

Is it possible to have overfitting after adopting 2 or 3. What is the best way to handle both categorical and numeric values in columns

However, I wonder what could be the best choice for categorical data is it after using the one hot encoding ?

Any help could be appreciated !


Solution

  • For usual practice, it is preferred to use df.fillna(df.mean(), inplace=True) for columns having continuous values and df.fillna(df.mode()[0], inplace=True) for categorical values.