Search code examples
pythonpandasdataframefiltergroup-by

filter result of groupby with pd.Series of boolean


Consider having two dataframes having the same column a. However in the first dataframe column a has unique values, whereas in the second one it does not although the possible values of column are the same in both dataframes and both of them are sorted on a.

df = pd.DataFrame([{'a': 1, 'b':2}, {'a':2, 'b': 4}, {'a':3, 'b': 4}])
df2 = pd.DataFrame([{'a': 1, 'c':2}, {'a': 1, 'c':3}, {'a':2, 'c': 4}, {'a': 2, 'c':5}, {'a': 3, 'c':5}, {'a': 3, 'c':5}])

From the first dataframe I want to consider only a given set of rows, based on column b. For example:

my_indexes = df1[df1['b']==4]

the result my_indexes is a pd.Series of booleans (not exactly indexes but let me keep this naming), and normally in a dataframe it can be used as df[my_indexes] to extract some rows. I want to use it on df2 to extract the rows with values a that corresponds to that for which 'b'==4, i.e., the ones for which valid entries are True. However, df2 has more rows than df. For this reason I first group on a as:

grouped = df2.groupby("a")

in this way grouped will have the same length of my_indexes, since all values of column a in df are also in df2 and since both are sorted on a. So I thought I could use valid directly on groupded as it were a df, but grouped[my_indexes] does not work.

How can I use my_indexes to filter df2?


Solution

  • First filter df2 based on values in df['a'] where the condition df['a'] == 1 is satisfied and then perform the groupby operation on the filtered df. Basically:

    import pandas as pd
    
    df = pd.DataFrame([{'a': 1, 'b': 2}, {'a': 2, 'b': 4}])
    df2 = pd.DataFrame([{'a': 1, 'c': 2}, {'a': 1, 'c': 3}, {'a': 2, 'c': 4}, {'a': 2, 'c': 5}])
    
    values_of_interest = df[df['a'] == 1]['a']
    
    filtered_df2 = df2[df2['a'].isin(values_of_interest)]
    grouped_df = filtered_df2.groupby('a')
    

    AS an example:

    
    for name, group in grouped_df:
        print(f'Group: {name}')
        print(group)
    

    Which gives

    Group: 1
       a  c
    0  1  2
    1  1  3