python pandas-groupby pivot-table one-hot-encoding dummy-variable

pivot long form categorical data by group and dummy code categorical variables

For the following dataframe, I am trying to pivot the categorical variable ('purchase_item') into wide format and dummy code them as 1/0 - based on whether or not a customer purchased it in each of the 4 quarters within 2016.

I would like to generate a pivotted dataframe as follows:

To get the desired result shown above, I have tried basically in various ways to combine groupby/pivot_table functions with a call to get_dummies() function in pandas. Example: data.groupby(["cust_id", "purchase_qtr"])["purchase_item"].reset_index().get_dummies()

However, none of my attempts have worked thus far.

Can somebody please help me generate the desired result?

Solution

One way of doing this is to get the crosstabulation, and then force all values > 1 to become 1, while keeping all 0's as they are:

TL;DR

out = (
    pd.crosstab([df["cust_id"], df["purchase_qtr"]], df["purchase_item"])
    .gt(0)
    .astype(int)
    .reset_index()
)

Breaking it all down:

Create Data

df = pd.DataFrame({
    "group1": np.repeat(["a", "b", "c"], 4),
    "group2": [1, 2, 3] * 4,
    "item": np.random.choice(["ab", "cd", "ef", "gh", "zx"], size=12)
})

print(df)
   group1  group2 item
0       a       1   cd
1       a       2   ef
2       a       3   gh
3       a       1   ef
4       b       2   zx
5       b       3   ab
6       b       1   ab
7       b       2   gh
8       c       3   gh
9       c       1   cd
10      c       2   ef
11      c       3   gh

Cross Tabulation

This returns a frequency table indicating how often each of the categories are observed together:

crosstab = pd.crosstab([df["group1"], df["group2"]], df["item"])

print(crosstab)
item           ab  cd  ef  gh  zx
group1 group2
a      1        0   1   1   0   0
       2        0   0   1   0   0
       3        0   0   0   1   0
b      1        1   0   0   0   0
       2        0   0   0   1   1
       3        1   0   0   0   0
c      1        0   1   0   0   0
       2        0   0   1   0   0
       3        0   0   0   2   0

Coerce Counts to Dummy Codes

Since we want to dummy code, and not count the co-occurance of categories, we can use a quick trick to force all values greater than 0 gt(0) to become 1 astype(int)

item           ab  cd  ef  gh  zx
group1 group2
a      1        0   1   1   0   0
       2        0   0   1   0   0
       3        0   0   0   1   0
b      1        1   0   0   0   0
       2        0   0   0   1   1
       3        1   0   0   0   0
c      1        0   1   0   0   0
       2        0   0   1   0   0
       3        0   0   0   1   0