Search code examples
pythonpandasgroup-byaggregate

Split Column of Semicolon-Separated Values and Duplicate Row with each Value in Pandas


I have a data frame with a row of data like this:

play_by_play = pd.DataFrame([{
    "players": "Tom Brady; Mike Evans; Tristan Wirfs; Leonard Fournette; Chris Godwin", 
               "down": 1, 
               "to_go": 10,
                "play_type": 'pass',
               "yards_gained": 8,
               "pass_attempt": 1,
               "complete_pass": 1,
               "rush_attempt": 0
              }])

I want to keep track of stats while the given players are on the field by using group by and aggregate functions. If I were looking to do this player-by-player, I would use play_by_play["players"].str.contains("Tom Brady") and aggregating the data using that filter, but I am looking to automate this. The solution I've landed on is to duplicate rows and have the "players" value split for each row. It would look something like this:

player down to_go play_type yards_gained pass_attempt complete_pass rush_attempt
"Tom Brady" 1 10 pass 8 1 1 0
"Mike Evans" 1 10 pass 8 1 1 0
"Tristan Wirfs" 1 10 pass 8 1 1 0
"Leonard Fournette" 1 10 pass 8 1 1 0
"Chris Godwin" 1 10 pass 8 1 1 0

How could I accomplish this? As I mentioned before, this needs to be pretty scalable, as there are thousands of rows of data. If there's an easier way to group by based on unique values contained in a semicolon-separated column, I'm more than happy to go that route.


Solution

  • You can use pandas.Series.str.split to make a list of the players then pandas.DataFrame.explode to make multiple rows :

    play_by_play['players'] = play_by_play['players'].str.split(';')
    play_by_play = play_by_play.explode('players').reset_index(drop=True)
    

    # Output :

    print(play_by_play)
    
                  players  down  to_go play_type  yards_gained  pass_attempt  complete_pass  rush_attempt
    0           Tom Brady     1     10      pass             8             1              1             0
    1          Mike Evans     1     10      pass             8             1              1             0
    2       Tristan Wirfs     1     10      pass             8             1              1             0
    3   Leonard Fournette     1     10      pass             8             1              1             0
    4        Chris Godwin     1     10      pass             8             1              1             0