Search code examples
pandasdataframecategorical-data

Combining Two Pandas Categorical Columns into One Column


I have a Pandas DataFrame that has two categorical columns:

df = pd.DataFrame({"x": ["A", "A", "B", "B", "B", "C", "C"],
                   "y": ["L", "L", "M", "M", "M", "N", "M"],
                  }).astype("category", "category")

   x  y
0  A  L
1  A  L
2  B  M
3  B  M
4  B  M
5  C  N
6  C  M

I would like to combine the two columns and store the results as a new categorical column but separated by " - ". One naive way of doing this is to convert the columns to strings:

df.assign(z=df.x.astype(str) + " - " + df.y.astype(str))

   x  y      z
0  A  L  A - L
1  A  L  A - L
2  B  M  B - M
3  B  M  B - M
4  B  M  B - M
5  C  N  C - N
6  C  M  C - M

This works for a small toy example but I need z to be of category dtype (not string). However, my x and y contains categorical strings (with 88903 and 39132 categories for x and y, respectively) that may be 50-100 characters long and around 500K rows. So, converting these columns to strings first is causing the memory to explode.

Is there a more efficient way to get a categorical output without using a ton of memory and taking too long?


Solution

  • You can try this:

    import pandas as pd
    from itertools import product
    
    # original data
    df = pd.DataFrame({"x": ["A", "A", "B", "B", "B", "C", "C"],
                       "y": ["L", "L", "M", "M", "M", "N", "M"],
                      }).astype("category", "category")
    
    # extract unique categories
    c1 = df.x.cat.categories
    c2 = df.y.cat.categories
    
    # make data frame with all possible category combinations
    df_cats = pd.DataFrame(list(product(c1, c2)), columns=['x', 'y'])
    
    # create desired column
    df_cats = df_cats.assign(grp=df_cats.x.astype('str') + '-' + df_cats.y.astype('str'))
    
    # join this column to the original data
    pd.merge(df, df_cats, how="left", left_on=["x", "y"], right_on=["x", "y"])