Search code examples
pythonpandas-groupbypivot-tableone-hot-encodingdummy-variable

pivot long form categorical data by group and dummy code categorical variables


For the following dataframe, I am trying to pivot the categorical variable ('purchase_item') into wide format and dummy code them as 1/0 - based on whether or not a customer purchased it in each of the 4 quarters within 2016.

enter image description here

I would like to generate a pivotted dataframe as follows: enter image description here

To get the desired result shown above, I have tried basically in various ways to combine groupby/pivot_table functions with a call to get_dummies() function in pandas. Example: data.groupby(["cust_id", "purchase_qtr"])["purchase_item"].reset_index().get_dummies()

However, none of my attempts have worked thus far.

Can somebody please help me generate the desired result?


Solution

  • One way of doing this is to get the crosstabulation, and then force all values > 1 to become 1, while keeping all 0's as they are:

    TL;DR

    out = (
        pd.crosstab([df["cust_id"], df["purchase_qtr"]], df["purchase_item"])
        .gt(0)
        .astype(int)
        .reset_index()
    )
    

    Breaking it all down:

    Create Data

    df = pd.DataFrame({
        "group1": np.repeat(["a", "b", "c"], 4),
        "group2": [1, 2, 3] * 4,
        "item": np.random.choice(["ab", "cd", "ef", "gh", "zx"], size=12)
    })
    
    print(df)
       group1  group2 item
    0       a       1   cd
    1       a       2   ef
    2       a       3   gh
    3       a       1   ef
    4       b       2   zx
    5       b       3   ab
    6       b       1   ab
    7       b       2   gh
    8       c       3   gh
    9       c       1   cd
    10      c       2   ef
    11      c       3   gh
    

    Cross Tabulation

    This returns a frequency table indicating how often each of the categories are observed together:

    crosstab = pd.crosstab([df["group1"], df["group2"]], df["item"])
    
    print(crosstab)
    item           ab  cd  ef  gh  zx
    group1 group2
    a      1        0   1   1   0   0
           2        0   0   1   0   0
           3        0   0   0   1   0
    b      1        1   0   0   0   0
           2        0   0   0   1   1
           3        1   0   0   0   0
    c      1        0   1   0   0   0
           2        0   0   1   0   0
           3        0   0   0   2   0
    

    Coerce Counts to Dummy Codes

    Since we want to dummy code, and not count the co-occurance of categories, we can use a quick trick to force all values greater than 0 gt(0) to become 1 astype(int)

    item           ab  cd  ef  gh  zx
    group1 group2
    a      1        0   1   1   0   0
           2        0   0   1   0   0
           3        0   0   0   1   0
    b      1        1   0   0   0   0
           2        0   0   0   1   1
           3        1   0   0   0   0
    c      1        0   1   0   0   0
           2        0   0   1   0   0
           3        0   0   0   1   0