I am about creating a regression model but I wonder what would be the best way to handle nan values for both numerical and categorical data:
I know that for the numerical columns the next solutions could be useful:
1- Replace it with 0: df.fillna(0, inplace=True)
2- Replace it with mean: df.fillna(df.mean(), inplace=True)
3- Replace it with median df.fillna(df.median(), inplace=True)
4- delete each row, in my target column, has nan value
Is it possible to have overfitting after adopting 2 or 3. What is the best way to handle both categorical and numeric values in columns
However, I wonder what could be the best choice for categorical data is it after using the one hot encoding ?
Any help could be appreciated !
For usual practice, it is preferred to use df.fillna(df.mean(), inplace=True)
for columns having continuous values and df.fillna(df.mode()[0], inplace=True)
for categorical values.