Search code examples
pythonpandasfilterslice

Conditional filtering of pandas data frame


I have a pandas data frame about football results. Each row of the dataframe represents a football match. The information of each match are:

Day | WinningTeamID | LosingTeamID | WinningPoints | LosingPoints | WinningFouls | ... | 
1          13             1              45                5               3  
1          12             4              21                12              4              

That is, the information are divided based on the game result: winning or losing. I would like to retrieve the data of each game for a specific team (e.g. 12).

Day | Points | Fouls | ... | 
1       21       4     ...
2       32       6     ...

The simplest way is to scan the whole dataframe, check if a specific teamID is on WinningID or LosingID and then, based on that, retrieve the "Losing-columns" or the "Winning-columns". Is there a more "elegant" way of slicing the pandas dataframe? This will simply give me the subset of matches where the team 12 is involved.

df[df[WinningTeamID == 12] | [LosingTeamID == 12]]

How can I filter those data and create the desired dataframe?


Solution

  • Suppose we could choose the format of the data. What would be ideal? Since we want to collect stats per TeamID, ideally we would have a column of TeamIDs and separate columns for each stat including the outcome.

    So the data would look like this:

    | Day | Outcome | TeamID | Points | Fouls |
    |   1 | Winning |     13 |     45 |     3 |
    |   1 | Losing  |      1 |      5 |   NaN |
    |   1 | Winning |     12 |     21 |     4 |
    |   1 | Losing  |      4 |     12 |   NaN |
    

    Here is how we can manipulate the given data into the desired form:

    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame({'Day': [1, 1], 'LosingPoints': [5, 12], 'LosingTeamID': [1, 4], 'WinningFouls': [3, 4], 'WinningPoints': [45, 21], 'WinningTeamID': [13, 12]})
    df = df.set_index(['Day'])
    columns = df.columns.to_series().str.extract(r'^(Losing|Winning)?(.*)', expand=True)
    columns = pd.MultiIndex.from_arrays([columns[col] for col in columns], 
                                        names=['Outcome', None])
    df.columns = columns
    df = df.stack(level='Outcome').reset_index()
    print(df)
    

    yields

       Day  Outcome  Fouls  Points  TeamID
    0    1   Losing    NaN       5       1
    1    1  Winning    3.0      45      13
    2    1   Losing    NaN      12       4
    3    1  Winning    4.0      21      12
    

    Now we can obtain all the stats about TeamID 12 using

    print(df.loc[df['TeamID']==12])
    #    Day  Outcome  Fouls  Points  TeamID
    # 3    1  Winning    4.0      21      12
    

    df = df.set_index(['Day']) moves the Day column into the index.

    The purpose of placing Day in the index is to "protect" it from manipulations (primarily the stack call) that are intended only for columns labeled Losing or Winning. If you had other columns, such as Location or Officials which, like Day, do not pertain to Losing or Winning, then you'd want to include them in the set_index call too: e.g. df = df.set_index(['Day', 'Location', 'Officials']).

    Try commenting out df = df.set_index(['Day']) from the code above. Then step through the code line-by-line. In particular, compare what df.stack(level='Outcome') looks like with and without the set_index call:

    With df = df.set_index(['Day']):

    In [26]: df.stack(level='Outcome')
    Out[26]: 
                 Fouls  Points  TeamID
    Day Outcome                       
    1   Losing     NaN       5       1
        Winning    3.0      45      13
        Losing     NaN      12       4
        Winning    4.0      21      12
    

    Without df = df.set_index(['Day']):

    In [29]: df.stack(level='Outcome')
    Out[29]: 
               Day  Fouls  Points  TeamID
      Outcome                            
    0 NaN      1.0    3.0      45      13
      Losing   NaN    NaN       5       1
      Winning  1.0    3.0      45      13
    1 NaN      1.0    4.0      21      12
      Losing   NaN    NaN      12       4
      Winning  1.0    4.0      21      12
    

    Without the set_index call you end up with rows that you do not want -- the rows where Outcome equals NaN.


    The purpose of

    columns = df.columns.to_series().str.extract(r'^(Losing|Winning)?(.*)', expand=True)
    columns = pd.MultiIndex.from_arrays([columns[col] for col in columns], 
                                        names=['Outcome', None])
    

    is to create a multi-level column index (called a MultiIndex) which labels columns Losing or Winning as appropriate. Notice that by separating out the Losing or Winning parts of the labels, the remaining parts of the labels become duplicated.

    We end up with a DataFrame, df, with two columns labeled "Points" for example. This allows Pandas to identify these columns as somehow similar.

    The big gain -- the reason why we went through the trouble of setting up the MultiIndex is so that these "similar" columns can be "unified" by calling df.stack:

    In [47]: df
    Out[47]: 
    Outcome Losing        Winning              
            Points TeamID   Fouls Points TeamID
    Day                                        
    1            5      1       3     45     13
    1           12      4       4     21     12
    
    In [48]: df.stack(level="Outcome")
    Out[48]: 
                 Fouls  Points  TeamID
    Day Outcome                       
    1   Losing     NaN       5       1
        Winning    3.0      45      13
        Losing     NaN      12       4
        Winning    4.0      21      12
    

    stack, unstack, set_index and reset_index are the 4 fundamental DataFrame reshaping operations.

    • df.stack moves a level (or levels) of the column index into the row index.
    • df.unstack moves a level (or levels) of the row index into the column index.
    • df.set_index moves column values into the row index
    • df.reset_index moves a level (or levels) of the row index into a column of values

    Together, these 4 methods allow you to move data in your DataFrame anywhere you want -- in the columns, the row index or the column index.

    The above code is an example of how to use these tools (well, three of the four) to reshape data into a desired form.