For the following dataframe, I am trying to pivot the categorical variable ('purchase_item') into wide format and dummy code them as 1/0 - based on whether or not a customer purchased it in each of the 4 quarters within 2016.
I would like to generate a pivotted dataframe as follows:
To get the desired result shown above, I have tried basically in various ways to combine groupby/pivot_table functions with a call to get_dummies() function in pandas. Example:
data.groupby(["cust_id", "purchase_qtr"])["purchase_item"].reset_index().get_dummies()
However, none of my attempts have worked thus far.
Can somebody please help me generate the desired result?
One way of doing this is to get the crosstabulation, and then force all values > 1 to become 1, while keeping all 0's as they are:
out = (
pd.crosstab([df["cust_id"], df["purchase_qtr"]], df["purchase_item"])
.gt(0)
.astype(int)
.reset_index()
)
Breaking it all down:
df = pd.DataFrame({
"group1": np.repeat(["a", "b", "c"], 4),
"group2": [1, 2, 3] * 4,
"item": np.random.choice(["ab", "cd", "ef", "gh", "zx"], size=12)
})
print(df)
group1 group2 item
0 a 1 cd
1 a 2 ef
2 a 3 gh
3 a 1 ef
4 b 2 zx
5 b 3 ab
6 b 1 ab
7 b 2 gh
8 c 3 gh
9 c 1 cd
10 c 2 ef
11 c 3 gh
This returns a frequency table indicating how often each of the categories are observed together:
crosstab = pd.crosstab([df["group1"], df["group2"]], df["item"])
print(crosstab)
item ab cd ef gh zx
group1 group2
a 1 0 1 1 0 0
2 0 0 1 0 0
3 0 0 0 1 0
b 1 1 0 0 0 0
2 0 0 0 1 1
3 1 0 0 0 0
c 1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 2 0
Since we want to dummy code, and not count the co-occurance of categories, we can use a quick trick to force all values greater than 0 gt(0)
to become 1 astype(int)
item ab cd ef gh zx
group1 group2
a 1 0 1 1 0 0
2 0 0 1 0 0
3 0 0 0 1 0
b 1 1 0 0 0 0
2 0 0 0 1 1
3 1 0 0 0 0
c 1 0 1 0 0 0
2 0 0 1 0 0
3 0 0 0 1 0