Search code examples
pythonpandasmulti-index

Python Pandas Multiindexing select rows that match all values in a list


Consider the following data frame.

import Pandas as pd
df = pd.Dataframe
df = pd.DataFrame()
df['Folder'] = [2,3,4,5  ,2,4,5, 2,3,4, 2,3,4,5,1]
df['Country'] = ['USA','USA','USA','USA'  ,'Mexico','Mexico','Mexico', 'UK','UK','UK', 'Canada','Canada','Canada','Canada','Canada']
df['Data'] = [20,30,43,15  ,25,44,15, 26,37,47, 24,34,47,55,18]
df.set_index(['Country','Folder'], drop=True, inplace=True)
df


                Data
Country Folder      
USA     2         20
        3         30
        4         43
        5         15
Mexico  2         25
        4         44
        5         15
UK      2         26
        3         37
        4         47
Canada  2         24
        3         34
        4         47
        5         55
        1         18

How do I collect the rows where Folder has all of lst=[1,3,4] in level Folder?

                Data
Country Folder      

Canada  2         24
        3         34
        4         47
        5         55
        1         18

OR

                Data
Country Folder      
Canada  3         34
        4         47
        1         18

Either would work for me. I want to know that Canada matches all of lst. lst may be up to 8 items long.

I have tried df.query("Folder in @lst") however that returns rows matching any of lst. I need matching All of lst.

Thanks in advance for any help.


Solution

  • This is an alternative to @jezrael's approach, where we group on the boolean values from isin and country:

    In [38]: (df.groupby([df.index.isin([1,3,4], level='Folder'),
                          df.index.get_level_values('Country')])
                .filter(lambda x: len(x)==3)
               )
    Out[38]: 
                    Data
    Country Folder      
    Canada  3         34
            4         47
            1         18
    

    Take advantage of the fact that you have three numbers in the list, so if it matches all, then it should be 3.

    To get all values, you could chunk the steps:

    mapping = df.index.isin([1,3,4], level = 'Folder')
    
    filtered = (pd.Series(mapping)
                .groupby(df.index.get_level_values('Country'))
                .transform(lambda x: sum(x)>=3)
                )
    
    In [61]: df.loc[filtered.array]
    Out[61]: 
                    Data
    Country Folder      
    Canada  2         24
            3         34
            4         47
            5         55
            1         18