Search code examples
pythonpandasdataframesparse-matrixone-hot-encoding

Encoding a Dataframe differently than One-Hot


Suppose i have a df similar that registers the playable-character picks by each of the 6 players (3v3) in a computer game.

data = {'Pick_1_team1': ['A','A','A','B','C'],
        'Pick_2_team1': ['D','D','E','F','F'],
        'Pick_3_team1': ['G','G','A','M','O'],
        'Pick_1_team2': ['Q','Q','S','S','A'],
        'Pick_2_team2': ['V','W','X','A','B'],
        'Pick_3_team2': ['R','X','W','W','R']}

df = pd.DataFrame(data)
df_enc = pd.get_dummies(df)

When i use panda's get_dummies i will get a sparse df with the following structure (example, not the actual encoding from the example df)

Every Pick-slot has every possible pick (A-Z), for both teams.

| P1_T1_A | P1_T1_B | P1_T1_C | ... | P2_T1_A | ... | P3_T1_Z | ... | P1_T2_A | P1_T2_B | ... | P3_T2_Z |
|---------|---------|---------|-----|---------|-----|---------|-----|---------|---------|-----|---------|
| 0       | 0       | 1       | ... | 0       | ... | 0       | ... | 1       | 0       | ... | 1       |
| 1       | 0       | 0       | ... | 0       | ... | 0       | ... | 0       | 1       | ... | 1       |
| 1       | 0       | 0       | ... | 0       | ... | 0       | ... | 0       | 0       | ... | 0       |

Which is OK but if you consider a larger input space (with 150+ possible picks) the table grows very large. To make it more manageable i was thinking of somehow encoding it into the following format:

One column per category (character) x 2 for each team, 1 if that character was picked, 0 if not.

| T1_A | T1_B | T1_C | ... | T1_Z | ... | T2_A | T2_B | ... | T2_Z |
|------|------|------|-----|------|-----|------|------|-----|------|
| 0    | 0    | 1    | ... | 0    | ... | 1    | 1    | ... | 1    |
| 1    | 1    | 1    | ... | 0    | ... | 0    | 0    | ... | 1    |
| 1    | 0    | 0    | ... | 0    | ... | 0    | 0    | ... | 0    |

This would limit the feature numbers to number of picks x number of teams (26 letters * 2 teams) instead of possible picks x number of picks x number of teams (26 letters * 3 picks * 2 teams) Can panda do this via any built_in function? If not what would be the simplest way to go about this?

Any help appreciated! Thanks.


Solution

  • You can do it by first selecting one team with filter, stack the data and use str.get_dummies, then groupby level=0 (the rows in original df) and sum. add_prefix to the columns before concat for both teams like:

    df_ = pd.concat([
                (df.filter(like=f'Pick_{i}').stack()
                   .str.get_dummies()
                   .groupby(level=0).sum()
                   .add_prefix(f'T{i}_')
                ) for i in [1,2] ], 
                axis=1)
    print (df_)
       T1_A  T1_B  T1_C  T1_D  T1_E  T1_F  T1_G  T1_M  T1_O  T2_A  T2_B  T2_Q  \
    0     1     0     0     1     0     0     1     0     0     0     0     1   
    1     1     0     0     1     0     0     1     0     0     0     0     1   
    2     2     0     0     0     1     0     0     0     0     0     0     0   
    3     0     1     0     0     0     1     0     1     0     1     0     0   
    4     0     0     1     0     0     1     0     0     1     1     1     0   
    
       T2_R  T2_S  T2_V  T2_W  T2_X  
    0     1     0     1     0     0  
    1     0     0     0     1     1  
    2     0     1     0     1     1  
    3     0     1     0     1     0  
    4     1     0     0     0     0