Suppose i have a df similar that registers the playable-character picks by each of the 6 players (3v3) in a computer game.
data = {'Pick_1_team1': ['A','A','A','B','C'],
'Pick_2_team1': ['D','D','E','F','F'],
'Pick_3_team1': ['G','G','A','M','O'],
'Pick_1_team2': ['Q','Q','S','S','A'],
'Pick_2_team2': ['V','W','X','A','B'],
'Pick_3_team2': ['R','X','W','W','R']}
df = pd.DataFrame(data)
df_enc = pd.get_dummies(df)
When i use panda's get_dummies i will get a sparse df with the following structure (example, not the actual encoding from the example df)
Every Pick-slot has every possible pick (A-Z), for both teams.
| P1_T1_A | P1_T1_B | P1_T1_C | ... | P2_T1_A | ... | P3_T1_Z | ... | P1_T2_A | P1_T2_B | ... | P3_T2_Z |
|---------|---------|---------|-----|---------|-----|---------|-----|---------|---------|-----|---------|
| 0 | 0 | 1 | ... | 0 | ... | 0 | ... | 1 | 0 | ... | 1 |
| 1 | 0 | 0 | ... | 0 | ... | 0 | ... | 0 | 1 | ... | 1 |
| 1 | 0 | 0 | ... | 0 | ... | 0 | ... | 0 | 0 | ... | 0 |
Which is OK but if you consider a larger input space (with 150+ possible picks) the table grows very large. To make it more manageable i was thinking of somehow encoding it into the following format:
One column per category (character) x 2 for each team, 1 if that character was picked, 0 if not.
| T1_A | T1_B | T1_C | ... | T1_Z | ... | T2_A | T2_B | ... | T2_Z |
|------|------|------|-----|------|-----|------|------|-----|------|
| 0 | 0 | 1 | ... | 0 | ... | 1 | 1 | ... | 1 |
| 1 | 1 | 1 | ... | 0 | ... | 0 | 0 | ... | 1 |
| 1 | 0 | 0 | ... | 0 | ... | 0 | 0 | ... | 0 |
This would limit the feature numbers to number of picks x number of teams (26 letters * 2 teams) instead of possible picks x number of picks x number of teams (26 letters * 3 picks * 2 teams) Can panda do this via any built_in function? If not what would be the simplest way to go about this?
Any help appreciated! Thanks.
You can do it by first selecting one team with filter
, stack
the data and use str.get_dummies
, then groupby
level=0 (the rows in original df) and sum
. add_prefix
to the columns before concat
for both teams like:
df_ = pd.concat([
(df.filter(like=f'Pick_{i}').stack()
.str.get_dummies()
.groupby(level=0).sum()
.add_prefix(f'T{i}_')
) for i in [1,2] ],
axis=1)
print (df_)
T1_A T1_B T1_C T1_D T1_E T1_F T1_G T1_M T1_O T2_A T2_B T2_Q \
0 1 0 0 1 0 0 1 0 0 0 0 1
1 1 0 0 1 0 0 1 0 0 0 0 1
2 2 0 0 0 1 0 0 0 0 0 0 0
3 0 1 0 0 0 1 0 1 0 1 0 0
4 0 0 1 0 0 1 0 0 1 1 1 0
T2_R T2_S T2_V T2_W T2_X
0 1 0 1 0 0
1 0 0 0 1 1
2 0 1 0 1 1
3 0 1 0 1 0
4 1 0 0 0 0