Search code examples
pythonpandascsvidioms

Is there a Pandas idiom for reading a CSV file with categorical data that has spelling variants?


I have a CSV file with multiple categorical columns, but most of these columns contain messy data due to typing mistakes (e.g., 'spciulated', 'SPICULATED', etc. for the category 'spiculated' of the column 'margins'). Is there a standard way to deal with such situations?

To be precise, I would like to read the CSV file directly into a clean DataFrame with a dtype category for the categorical columns, but with all variants collapsed into one category (e.g., each variant of 'spiculated' would be read as 'spiculated'). The spelling variants could be given by a dict, for instance.

Expected solution:

import pandas as pd

FEAT_VALS = {
    "margins": {
        "spiculated": ["spiculated", "spiiculated", "SPICULATED"],
        "circumscribed": ["circumscribed", "cicumscribed"],
    }
}

# somehow give FEAT_VALS to read_csv
df = pd.read_csv('test.csv', dtype='category')
df.margins

where test.csv is:

margins
spiculated
spiiculated
SPICULATED
circumscribed
cicumscribed

to obtain:

0       spiculated
1       spiculated
2       spiculated
3    circumscribed
4    circumscribed
Name: margins, dtype: category
Categories (2, object): ['circumscribed', 'spiculated']

However, without the spelling variants information, I get:

0       spiculated
1      spiiculated
2       SPICULATED
3    circumscribed
4     cicumscribed
Name: margins, dtype: category
Categories (5, object): ['SPICULATED', 'cicumscribed', 'circumscribed', 'spiculated', 'spiiculated']

My current solution: Looks like this

df2 = pd.read_csv('test.csv')

for feat, feat_vals in FEAT_VALS.items():
    for enc_val, str_vals in feat_vals.items():
        df2.loc[df2[feat].isin(str_vals), feat] = enc_val

df2.margins = df2.margins.astype('category')

Solution

  • You could flip the inner dicts then use .map(). This isn't as direct as you want, but at least the code using Pandas is cleaner.

    for feat, feat_vals in FEAT_VALS.items():
        feat_strs = {
            str_val: enc_val
            for enc_val, str_vals in feat_vals.items()
            for str_val in str_vals
        }
        df[feat] = df[feat].map(feat_strs).astype('category')
    
        # For demo
        print(df[feat].cat.categories)
    

    Output:

    Index(['circumscribed', 'spiculated'], dtype='object')