Search code examples
pandasdataframeone-hot-encodingfeature-engineering

Aggregate features row-wise in dataframe


i am trying to create features from sample that looks like this:

index user product sub_product status
0 u1 p1 sp1 NA
1 u1 p1 sp2 NA
2 u1 p1 sp3 CANCELED
3 u1 p1 sp4 AVAIL
4 u2 p3 sp2 AVAIL
5 u2 p3 sp3 CANCELED
6 u2 p3 sp7 NA

first, i created dummies:

pd.get_dummies(x, columns = ['product', 'sub_product', 'status']

but i also need to group by row, to have 1 row by user, what is the best way to do it?
If i'll just group it:

pd.get_dummies(x, columns = ['product', 'sub_product', 'status'].groupby('user').max()
user product_p1 product_p3 sub_product_sp1 sub_product_sp2 sub_product_sp3 sub_product_sp4 sub_product_sp7 status_AVAIL status_CANCELED status_NA
u1 1 0 1 1 1 1 0 1 1 1
u2 0 1 0 1 1 0 1 1 1 1

i will loose information, fo ex. that for u1 sp3 status is canceled. So it's looks like i have to create dummies for every column combination?


Solution

  • Update: You are basically looking for pivot:

    out = (df.astype(str)
       .assign(value=1)
       .pivot_table(index=['user'], columns=['product','sub_product','status'],
                    values='value', fill_value=0, aggfunc='max')
    )
    
    out.columns = ['_'.join(x) for x in out.columns]