Search code examples
pandasnumpycategorical-data

Update categories in two Series / Columns for comparison


If I try to compare two Series with different categories I get an error:

a = pd.Categorical([1, 2, 3])
b = pd.Categorical([4, 5, 3])
df = pd.DataFrame([a, b], columns=['a', 'b'])

   a  b
0  1  4
1  2  5
2  3  3

df.a == df.b

# TypeError: Categoricals can only be compared if 'categories' are the same.

What is the best way to update categories in both Series? Thank you!

My solution:

df['b'] = df.b.cat.add_categories(df.a.cat.categories.difference(df.b.cat.categories))
df['a'] = df.a.cat.add_categories(df.b.cat.categories.difference(df.a.cat.categories))
df.a == df.b

Output:

0    False
1    False
2     True
dtype: bool

Solution

  • One idea with union_categoricals:

    from pandas.api.types import union_categoricals
    
    union = union_categoricals([df.a, df.b]).categories
    
    df['a'] = df.a.cat.set_categories(union)
    df['b'] = df.b.cat.set_categories(union)
    print (df.a == df.b)
    0    False
    1    False
    2     True
    dtype: bool