python pandas dataframe encoding one-hot-encoding

Binary Vectorization Encoding for categorical variable grouped by date issue

I'm having an issue trying to vectorize this in some kind of binary encoding but aggregated when there is more than one row (as the variations of the categorical variable are non-exclusive), yet avoiding merging it with other dates. (python and pandas)

Let's say this is the data

id1	id2	type	month.measure
105	50	growing	04-2020
105	50	advancing	04-2020
44	29	advancing	04-2020
105	50	retreating	05-2020
105	50	shrinking	05-2020

It would have to end like this

id1	id2	growing	shrinking	advancing	retreating	month.measure
105	50	1	0	1	0	04-2020
44	29	0	0	1	0	04-2020
105	50	0	1	0	1	05-2020

I've been trying with transformations of all kinds, lambda functions, pandas get_dummies and trying to aggregate them grouped by the 2 ids and the date but I couldn't find a way.

Hope we can sort it out! Thanks in advance! :)

Solution

This solution uses pandas get_dummies to one-hot encode the "TYPE" column, then concatenates the one-hot encoded dataframe back with the original, followed by a groupby applied to the ID columns and "MONTH":

# Set up the dataframe
ID1 = [105,105,44,105,105]
ID2 = [50,50,29,50,50]
TYPE = ['growing','advancing','advancing','retreating','shrinking']
MONTH = ['04-2020','04-2020','04-2020','05-2020','05-2020']

df = pd.DataFrame({'ID1':ID1,'ID2':ID2, 'TYPE':TYPE, 'MONTH.MEASURE':MONTH})

# Apply get_dummies and groupby operations
df = pd.concat([df.drop('TYPE',axis=1),pd.get_dummies(df['TYPE'])],axis=1)\
       .groupby(['ID1','ID2','MONTH.MEASURE']).sum().reset_index()

# These bits are just cosmetic to get the output to look more like your required output
df.columns = [c.upper() for c in df.columns]

col_order = ['GROWING','SHRINKING','ADVANCING','RETREATING','MONTH.MEASURE']

df[['ID1','ID2']+col_order]

#    ID1  ID2  GROWING  SHRINKING  ADVANCING  RETREATING MONTH.MEASURE
# 0   44   29        0          0          1           0       04-2020
# 1  105   50        1          0          1           0       04-2020
# 2  105   50        0          1          0           1       05-2020