i am trying to create features from sample that looks like this:
index | user | product | sub_product | status |
---|---|---|---|---|
0 | u1 | p1 | sp1 | NA |
1 | u1 | p1 | sp2 | NA |
2 | u1 | p1 | sp3 | CANCELED |
3 | u1 | p1 | sp4 | AVAIL |
4 | u2 | p3 | sp2 | AVAIL |
5 | u2 | p3 | sp3 | CANCELED |
6 | u2 | p3 | sp7 | NA |
first, i created dummies:
pd.get_dummies(x, columns = ['product', 'sub_product', 'status']
but i also need to group by row, to have 1 row by user, what is the best way to do it?
If i'll just group it:
pd.get_dummies(x, columns = ['product', 'sub_product', 'status'].groupby('user').max()
user | product_p1 | product_p3 | sub_product_sp1 | sub_product_sp2 | sub_product_sp3 | sub_product_sp4 | sub_product_sp7 | status_AVAIL | status_CANCELED | status_NA |
---|---|---|---|---|---|---|---|---|---|---|
u1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
u2 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
i will loose information, fo ex. that for u1 sp3 status is canceled. So it's looks like i have to create dummies for every column combination?
Update: You are basically looking for pivot:
out = (df.astype(str)
.assign(value=1)
.pivot_table(index=['user'], columns=['product','sub_product','status'],
values='value', fill_value=0, aggfunc='max')
)
out.columns = ['_'.join(x) for x in out.columns]