I have a data frame with a row of data like this:
play_by_play = pd.DataFrame([{
"players": "Tom Brady; Mike Evans; Tristan Wirfs; Leonard Fournette; Chris Godwin",
"down": 1,
"to_go": 10,
"play_type": 'pass',
"yards_gained": 8,
"pass_attempt": 1,
"complete_pass": 1,
"rush_attempt": 0
}])
I want to keep track of stats while the given players are on the field by using group by and aggregate functions. If I were looking to do this player-by-player, I would use play_by_play["players"].str.contains("Tom Brady")
and aggregating the data using that filter, but I am looking to automate this. The solution I've landed on is to duplicate rows and have the "players" value split for each row. It would look something like this:
player | down | to_go | play_type | yards_gained | pass_attempt | complete_pass | rush_attempt |
---|---|---|---|---|---|---|---|
"Tom Brady" | 1 | 10 | pass | 8 | 1 | 1 | 0 |
"Mike Evans" | 1 | 10 | pass | 8 | 1 | 1 | 0 |
"Tristan Wirfs" | 1 | 10 | pass | 8 | 1 | 1 | 0 |
"Leonard Fournette" | 1 | 10 | pass | 8 | 1 | 1 | 0 |
"Chris Godwin" | 1 | 10 | pass | 8 | 1 | 1 | 0 |
How could I accomplish this? As I mentioned before, this needs to be pretty scalable, as there are thousands of rows of data. If there's an easier way to group by based on unique values contained in a semicolon-separated column, I'm more than happy to go that route.
You can use pandas.Series.str.split
to make a list of the players then pandas.DataFrame.explode
to make multiple rows :
play_by_play['players'] = play_by_play['players'].str.split(';')
play_by_play = play_by_play.explode('players').reset_index(drop=True)
print(play_by_play)
players down to_go play_type yards_gained pass_attempt complete_pass rush_attempt
0 Tom Brady 1 10 pass 8 1 1 0
1 Mike Evans 1 10 pass 8 1 1 0
2 Tristan Wirfs 1 10 pass 8 1 1 0
3 Leonard Fournette 1 10 pass 8 1 1 0
4 Chris Godwin 1 10 pass 8 1 1 0