Combining Two Pandas Categorical Columns into One Column

I have a Pandas DataFrame that has two categorical columns:

df = pd.DataFrame({"x": ["A", "A", "B", "B", "B", "C", "C"],
                   "y": ["L", "L", "M", "M", "M", "N", "M"],
                  }).astype("category", "category")

   x  y
0  A  L
1  A  L
2  B  M
3  B  M
4  B  M
5  C  N
6  C  M

I would like to combine the two columns and store the results as a new categorical column but separated by " - ". One naive way of doing this is to convert the columns to strings:

df.assign(z=df.x.astype(str) + " - " + df.y.astype(str))

   x  y      z
0  A  L  A - L
1  A  L  A - L
2  B  M  B - M
3  B  M  B - M
4  B  M  B - M
5  C  N  C - N
6  C  M  C - M

This works for a small toy example but I need z to be of category dtype (not string). However, my x and y contains categorical strings (with 88903 and 39132 categories for x and y, respectively) that may be 50-100 characters long and around 500K rows. So, converting these columns to strings first is causing the memory to explode.

Is there a more efficient way to get a categorical output without using a ton of memory and taking too long?

Solution

You can try this:

import pandas as pd
from itertools import product

# original data
df = pd.DataFrame({"x": ["A", "A", "B", "B", "B", "C", "C"],
                   "y": ["L", "L", "M", "M", "M", "N", "M"],
                  }).astype("category", "category")

# extract unique categories
c1 = df.x.cat.categories
c2 = df.y.cat.categories

# make data frame with all possible category combinations
df_cats = pd.DataFrame(list(product(c1, c2)), columns=['x', 'y'])

# create desired column
df_cats = df_cats.assign(grp=df_cats.x.astype('str') + '-' + df_cats.y.astype('str'))

# join this column to the original data
pd.merge(df, df_cats, how="left", left_on=["x", "y"], right_on=["x", "y"])