I have a Pandas DataFrame that has two categorical columns:
df = pd.DataFrame({"x": ["A", "A", "B", "B", "B", "C", "C"],
"y": ["L", "L", "M", "M", "M", "N", "M"],
}).astype("category", "category")
x y
0 A L
1 A L
2 B M
3 B M
4 B M
5 C N
6 C M
I would like to combine the two columns and store the results as a new categorical column but separated by " - ". One naive way of doing this is to convert the columns to strings:
df.assign(z=df.x.astype(str) + " - " + df.y.astype(str))
x y z
0 A L A - L
1 A L A - L
2 B M B - M
3 B M B - M
4 B M B - M
5 C N C - N
6 C M C - M
This works for a small toy example but I need z
to be of category
dtype (not string). However, my x
and y
contains categorical strings (with 88903 and 39132 categories for x
and y
, respectively) that may be 50-100 characters long and around 500K rows. So, converting these columns to strings first is causing the memory to explode.
Is there a more efficient way to get a categorical output without using a ton of memory and taking too long?
You can try this:
import pandas as pd
from itertools import product
# original data
df = pd.DataFrame({"x": ["A", "A", "B", "B", "B", "C", "C"],
"y": ["L", "L", "M", "M", "M", "N", "M"],
}).astype("category", "category")
# extract unique categories
c1 = df.x.cat.categories
c2 = df.y.cat.categories
# make data frame with all possible category combinations
df_cats = pd.DataFrame(list(product(c1, c2)), columns=['x', 'y'])
# create desired column
df_cats = df_cats.assign(grp=df_cats.x.astype('str') + '-' + df_cats.y.astype('str'))
# join this column to the original data
pd.merge(df, df_cats, how="left", left_on=["x", "y"], right_on=["x", "y"])