Search code examples
pythonpandasdata-cleaning

Removing rows where there is a value match


def remove_low_data_states(column_name):
    items = df[column_name].value_counts().reset_index()
    items.columns = ['place', 'value']
    print(f'Items in column: [{column_name}] with low data')
    return list(items[items['value'].apply(lambda val: val < items.value.median())].place)

remove_low_data_states('col1') -- > returns ['hello', 'bye']

Orignal table

col1 col2 col3
hello 2 4
world 2 4
bye 2 4

Updated table

col1 col2 col3
world 2 4

The above method gives me a list of names within a column that do not pass the median criteria. How can I then use the list of names to go and remove the rows that are associated with the row value ??

I have tried using pd.drop but that is not to helpful, or I am making some sort of mistake.


Solution

  • We can use .isin()

    
    def remove_low_data_states(column_name):
        items = df[column_name].value_counts().reset_index()
        items.columns = ['place', 'value']
        print(f'Items in column: [{column_name}] with low data')
        return list(items[items['value'].apply(lambda val: val < items.value.median())].place)
    
    df = df[~df['col1'].isin(remove_low_data_states('col1'))]
    
    df.head()