I am working on a rental price prediction project where I web scraped data from Facebook Marketplace. When extracting the areas of the properties, I am encountering many NaN values.
I am web scraping from a small city and it is unlikely that I will be able to find more data. How can I effectively handle the NaN values in my data? Are there any machine learning algorithms or external sources of information that can be used to impute missing values in this situation?
Any suggestions or advice would be greatly appreciated. Thank you in advance!
I have considered using the mean or median based on property type, number of bedrooms, and bathrooms, but I am not sure if this is the best approach.
There are many methods that you can use when it comes to missing values in your data. As you mentioned general approach is to fill with mean-median. I recommend grouping them first then filling with mean or median.
df['a'].fillna(df.groupby('b')['a'].transform('mean'))
I recon you can use zipcode or something similar to group them.
Another thing you can do is before filling empty places, create another column that indicates if the values are missing. this may help your model to treat those values differently and don't overfit on those values.
For further info link