Search code examples
pythonpandaspandas-groupbycontains

Python pandas - Dropping grouped rows based on missing territory code


In the df below, we have two "Mixes" as indicated by the Mix_Name and Mix_ID columns. And within each of these mixes are multiple tracks with unique Track_ID's that contain different territories (see Territories column).

Mix_Name    Mix_ID  Territories     Track_Title  Artist         Track_ID
Mix Vol 1   abc123  GB | US | MX    Cabala       Prok & Fitch   12345678
Mix Vol 1   abc123  US | CA         Autosave     Fedde          9101112
Mix Vol 1   abc123  US | CA         4 Life       Firebeatz      4151617
Mix Vol 1   abc123  IE | MX         4 Life       Firebeatz      18192021
Mix Vol 2   xyz456  CA              Warrior      My Nu Leng     22232425
Mix Vol 2   xyz456  FR | DE | NE    Warrior      My Nu Leng     27282930
Mix Vol 2   xyz456  US | CA         I'm Madonna  Madonna        31323334
Mix Vol 2   xyz456  CA | US | FR    Trouble      Gregor Salto   35363738

What I'm hoping to do here find out which tracks do not have the US territory, and if any of the tracks do not have the US territory, I will want to drop the entire mix from my dataframe. With the resulting dataframe looking like this because "Mix Vol 2" is missing the US territory in one of its tracks:

Mix_Name    Mix_ID  Territories     Track_Title  Artist         Track_ID
Mix Vol 1   abc123  GB | US | MX    Cabala       Prok & Fitch   12345678
Mix Vol 1   abc123  US | CA         Autosave     Fedde          9101112
Mix Vol 1   abc123  US | CA         4 Life       Firebeatz      4151617
Mix Vol 1   abc123  IE | MX         4 Life       Firebeatz      18192021

I know that I need to Groupby: 'Mix_ID', 'Track_Title', and 'Artist' but I'm unsure of how to search the territories column to see if it contains the "US" territory. Any help would be much appreciated!


Solution

  • df.groupby(['Mix_Name', 'Track', 'Artist']).filter(lambda x: (x['Territories'].str.contains('US').any()))
    

    You do your groupby, then filter the groups to check if the territories column contains 'US'.

    EDIT

    notInUS = df.groupby(['Mix_Name','Track_Title','Artist']).filter(lambda x: ~x['Territories'].str.contains('US').any())['Mix_Name'].unique()
    df[~df['Mix_Name'].isin(notInUS)]
    

    We can look for the unique Mix_Name's that doesn't have a track in the US, then filter it out of our main df.