python pandas dataframe sparse-matrix one-hot-encoding

Encoding a Dataframe differently than One-Hot

Suppose i have a df similar that registers the playable-character picks by each of the 6 players (3v3) in a computer game.

data = {'Pick_1_team1': ['A','A','A','B','C'],
        'Pick_2_team1': ['D','D','E','F','F'],
        'Pick_3_team1': ['G','G','A','M','O'],
        'Pick_1_team2': ['Q','Q','S','S','A'],
        'Pick_2_team2': ['V','W','X','A','B'],
        'Pick_3_team2': ['R','X','W','W','R']}

df = pd.DataFrame(data)
df_enc = pd.get_dummies(df)

When i use panda's get_dummies i will get a sparse df with the following structure (example, not the actual encoding from the example df)

Every Pick-slot has every possible pick (A-Z), for both teams.

| P1_T1_A | P1_T1_B | P1_T1_C | ... | P2_T1_A | ... | P3_T1_Z | ... | P1_T2_A | P1_T2_B | ... | P3_T2_Z |
|---------|---------|---------|-----|---------|-----|---------|-----|---------|---------|-----|---------|
| 0       | 0       | 1       | ... | 0       | ... | 0       | ... | 1       | 0       | ... | 1       |
| 1       | 0       | 0       | ... | 0       | ... | 0       | ... | 0       | 1       | ... | 1       |
| 1       | 0       | 0       | ... | 0       | ... | 0       | ... | 0       | 0       | ... | 0       |

Which is OK but if you consider a larger input space (with 150+ possible picks) the table grows very large. To make it more manageable i was thinking of somehow encoding it into the following format:

One column per category (character) x 2 for each team, 1 if that character was picked, 0 if not.

| T1_A | T1_B | T1_C | ... | T1_Z | ... | T2_A | T2_B | ... | T2_Z |
|------|------|------|-----|------|-----|------|------|-----|------|
| 0    | 0    | 1    | ... | 0    | ... | 1    | 1    | ... | 1    |
| 1    | 1    | 1    | ... | 0    | ... | 0    | 0    | ... | 1    |
| 1    | 0    | 0    | ... | 0    | ... | 0    | 0    | ... | 0    |

This would limit the feature numbers to number of picks x number of teams (26 letters * 2 teams) instead of possible picks x number of picks x number of teams (26 letters * 3 picks * 2 teams) Can panda do this via any built_in function? If not what would be the simplest way to go about this?

Any help appreciated! Thanks.

Solution

You can do it by first selecting one team with filter, stack the data and use str.get_dummies, then groupby level=0 (the rows in original df) and sum. add_prefix to the columns before concat for both teams like:

df_ = pd.concat([
            (df.filter(like=f'Pick_{i}').stack()
               .str.get_dummies()
               .groupby(level=0).sum()
               .add_prefix(f'T{i}_')
            ) for i in [1,2] ], 
            axis=1)
print (df_)
   T1_A  T1_B  T1_C  T1_D  T1_E  T1_F  T1_G  T1_M  T1_O  T2_A  T2_B  T2_Q  \
0     1     0     0     1     0     0     1     0     0     0     0     1   
1     1     0     0     1     0     0     1     0     0     0     0     1   
2     2     0     0     0     1     0     0     0     0     0     0     0   
3     0     1     0     0     0     1     0     1     0     1     0     0   
4     0     0     1     0     0     1     0     0     1     1     1     0   

   T2_R  T2_S  T2_V  T2_W  T2_X  
0     1     0     1     0     0  
1     0     0     0     1     1  
2     0     1     0     1     1  
3     0     1     0     1     0  
4     1     0     0     0     0