Search code examples
pythonpandasdataframegroup-by

dataframe filter groupby based on a subset


df_example = pd.DataFrame({'name': ['a', 'a', 'a', 'b', 'b', 'b'],
                          'class': [1, 2, 2, 3, 2, 2],
                          'price': [3, 4, 2, 1, 6, 5]})

I want to filter each name where the price is larger than the smallest price in a subset class==2 within name grounp:

df_example.sort_values(['name', 'price'], inplace=True)
df_tem = df_example[df_example['class'] == 2].groupby('name').first()

Below is the pseudocode:

df_example.groupby('name').apply(lambda key, val: val['price'] > df_tem.loc[key]['price']).reset_index()

Is there any effective way to achieve something like filter dataframe based on a subset within groupby

result:

the smallest price with class=2 for each name group df_tem:

    class   price
name        
a   2   2
b   2   5

Therefore,

group a: price>2; group b: price>5

the output:

pd.DataFrame({'name': ['a', 'a', 'b'],
                          'class': [1, 2, 2],
                          'price': [3, 4, 6]})

Update:

actually i have an idea that create a new column called smallest, then filter

df_example by df_example['price'] > df_example['smallest ']. 

Do you know how to quickly create such column something like

df_example['smallest '] = df_example[df_example['class'] == 2].groupby('name')['price'].transform('first')

above way still have nan


Solution

  • Code

    Use groupby to aggregate the minimum and use map to map the result to name column for boolean indexing.

    m = df_example[df_example['class'] == 2].groupby('name')['price'].min()
    out = df_example[df_example['price'] > df_example['name'].map(m)]
    

    out

      name  class  price
    0    a      1      3
    1    a      2      4
    4    b      2      6
    

    update additional question

    Additionally, if you want to include name for which class=2 does not exist, use the code below

    m = df_example[df_example['class'] == 2].groupby('name')['price'].min()
    cond1 = df_example['price'] > df_example['name'].map(m)
    cond2 = ~df_example['name'].isin(m.index)
    out = df_example[cond1 | cond2]
    

    new example

    df_example = pd.DataFrame({'name': ['a', 'a', 'a', 'b', 'b', 'b', 'c'],
                              'class': [1, 2, 2, 3, 2, 2, 3],
                              'price': [3, 4, 2, 1, 6, 5, 5]})
    

    out:

      name  class  price
    0    a      1      3
    1    a      2      4
    4    b      2      6
    6    c      3      5