Search code examples
pythonpandascategorical-datadtype

Pandas series - how to validate each item is categorical


I am importing data which should be categorical from an externally sourced csv file into a pandas dataframe.

The first thing I want to do is to validate that the values are valid for the categorical type.

My strategy is to create an instance of CategoricalDtype and then using apply to test each value.

Question: The only way I can figure out is to test each value is in CategoricalDtype.categories.values but is there a "better" way? are there any methods I can use to achieve the same? I'm new to CategoricalDtype and it doesnt feel like this is the best way to be testing the data value.

# example of what I'm doing
import pandas as pd
from pandas.api.types import CategoricalDtype


df = pd.read_csv('data.csv')
cat = CategoricalDtype(categories=["A", "B", "C"], ordered=False)

df['data_is_valid']=df['data_field'].apply(lambda x: x in cat.categories.values)

Solution

  • If need test if exist values from column data_field :

    df['data_is_valid']=df['data_field'].isin(cat.categories)
    

    If need test also categorical_dtype:

    from pandas.api.types import is_categorical_dtype
    
    df['data_is_valid']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
    

    Difference is possible see in data sample:

    from pandas.api.types import CategoricalDtype
    from pandas.api.types import is_categorical_dtype
                
    df = pd.DataFrame({ "data_field": ["A", "B", "C", "D", 'E']})
    
    cat = CategoricalDtype(categories=["A", "B", "C"], ordered=False)
    
    #categories match but not Categorical
    df['data_is_valid1']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
    #categories match not tested Categorical
    df['data_is_valid2']=df['data_field'].isin(cat.categories)
    
    cat_type = CategoricalDtype(categories=["A", "B", "C", 'D', 'E'], ordered=True)
    
    #created Categorical column
    df['data_field'] = df['data_field'].astype(cat_type)
    
    #categoriesand Categorical match
    df['data_is_valid3']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
    #categories match not tested Categorical
    df['data_is_valid4']=df['data_field'].isin(cat.categories)
    
    print (df)
      data_field  data_is_valid1  data_is_valid2  data_is_valid3  data_is_valid4
    0          A           False            True            True            True
    1          B           False            True            True            True
    2          C           False            True            True            True
    3          D           False           False           False           False
    4          E           False           False           False           False