I'm trying to contruct an ML model and don't want to skew it. I will be employing SMOTE but there are some of the value counts in the house_type column that are just too low to merit keeping them in the dataset when constructing this model. How do I drop the rows that have these values?
I was trying this
balears_df = balears_df.drop(balears_df[(balears_df.value_counts('house_type') < 500)])
where balears_df is the dataset and house_type is the column concerned
This obviously has not worked cause my coding skills are severely lacking
I'm trying to drop any row from df_balears that has a value with an overall count less than 500, i.e. if there are less than 500 entries with a certain value I would like all of those entries dropped.
Any suggestions?
group by the column then filter to only keep the ones that have more than 500 entries
balears_df.groupby('house_type').filter(lambda x : len(x)>500)