Search code examples
pythonpandasdataframeencodingone-hot-encoding

Binary Vectorization Encoding for categorical variable grouped by date issue



I'm having an issue trying to vectorize this in some kind of binary encoding but aggregated when there is more than one row (as the variations of the categorical variable are non-exclusive), yet avoiding merging it with other dates. (python and pandas)

Let's say this is the data

id1 id2 type month.measure
105 50 growing 04-2020
105 50 advancing 04-2020
44 29 advancing 04-2020
105 50 retreating 05-2020
105 50 shrinking 05-2020

It would have to end like this

id1 id2 growing shrinking advancing retreating month.measure
105 50 1 0 1 0 04-2020
44 29 0 0 1 0 04-2020
105 50 0 1 0 1 05-2020

I've been trying with transformations of all kinds, lambda functions, pandas get_dummies and trying to aggregate them grouped by the 2 ids and the date but I couldn't find a way.

Hope we can sort it out! Thanks in advance! :)


Solution

  • This solution uses pandas get_dummies to one-hot encode the "TYPE" column, then concatenates the one-hot encoded dataframe back with the original, followed by a groupby applied to the ID columns and "MONTH":

    # Set up the dataframe
    ID1 = [105,105,44,105,105]
    ID2 = [50,50,29,50,50]
    TYPE = ['growing','advancing','advancing','retreating','shrinking']
    MONTH = ['04-2020','04-2020','04-2020','05-2020','05-2020']
    
    df = pd.DataFrame({'ID1':ID1,'ID2':ID2, 'TYPE':TYPE, 'MONTH.MEASURE':MONTH})
    
    # Apply get_dummies and groupby operations
    df = pd.concat([df.drop('TYPE',axis=1),pd.get_dummies(df['TYPE'])],axis=1)\
           .groupby(['ID1','ID2','MONTH.MEASURE']).sum().reset_index()
    
    # These bits are just cosmetic to get the output to look more like your required output
    df.columns = [c.upper() for c in df.columns]
    
    col_order = ['GROWING','SHRINKING','ADVANCING','RETREATING','MONTH.MEASURE']
    
    df[['ID1','ID2']+col_order]
    
    #    ID1  ID2  GROWING  SHRINKING  ADVANCING  RETREATING MONTH.MEASURE
    # 0   44   29        0          0          1           0       04-2020
    # 1  105   50        1          0          1           0       04-2020
    # 2  105   50        0          1          0           1       05-2020