Search code examples
pythonpandascategorical-data

Redefine categories of a categorical variable ignoring upper and lower case


I have a dataset with a categorical variable that is not nicely coded. The same category appears sometimes with upper case letters and sometimes with lower case (and several variations of it). Since I have a large dataset, I would like to harmonize the categories taking advantage of the categorical dtype - therefore exclude any replace solution. The only solutions I found are this and this, but I feel they implicitly make use of replace.

I report a toy example below and the solutions I tried

from pandas import Series

# Create dataset
df = Series(["male", "female","Male", "FEMALE", "MALE", "MAle"], dtype="category", name = "NEW_TEST")

# Define the old, the "new" and the desired categories
original_categories = list(df.cat.categories)
standardised_categories = list(map(lambda x: x.lower(), df.cat.categories)) 
desired_new_cat = list(set(standardised_categories))

# Failed attempt to change categories   
df.cat.categories = standardised_categories
df = df.cat.rename_categories(standardised_categories)
# Error message: Categorical categories must be unique

Solution

  • You shouldn't try to harmonize after converting to category. This renders the use of a Category pointless as one category per exact string will be created.

    You can instead harmonize the case with str.capitalize, then convert to categorical:

    s = (pd.Series(["male", "female","Male", "FEMALE", "MALE", "MAle"],
                   name = "NEW_TEST")
           .str.capitalize().astype('category')
         )
    

    If you already have a category, convert back to string and start over:

    s = s.astype(str).str.capitalize().astype('category')
    

    Output:

    0      Male
    1    Female
    2      Male
    3    Female
    4      Male
    5      Male
    Name: NEW_TEST, dtype: category
    Categories (2, object): ['Female', 'Male']