Search code examples
pythondatasetdata-analysis

How to drop rows from a dataset where the value counts are low


I'm trying to contruct an ML model and don't want to skew it. I will be employing SMOTE but there are some of the value counts in the house_type column that are just too low to merit keeping them in the dataset when constructing this model. How do I drop the rows that have these values?

I was trying this

balears_df = balears_df.drop(balears_df[(balears_df.value_counts('house_type') < 500)])

where balears_df is the dataset and house_type is the column concerned

This obviously has not worked cause my coding skills are severely lacking

I'm trying to drop any row from df_balears that has a value with an overall count less than 500, i.e. if there are less than 500 entries with a certain value I would like all of those entries dropped.

Any suggestions?


Solution

  • group by the column then filter to only keep the ones that have more than 500 entries

    balears_df.groupby('house_type').filter(lambda x : len(x)>500)