Search code examples
pythonpandasdataframecastingdtype

Changing the dtype of category codes in pandas


Let's say I have a boolean column stored as a category in a pandas.DataFrame. But there's a twist - the underlying values are str, not bool. I.e., the values are "True"/"False", not True/False.

How do I:

  1. change the dtype of the underlying category values (e.g. from "True" to True) and
  2. continue storing the field as a category?

Having the boolean values as strings is an issue with DataFrame.query, for example. I have to specify DataFrame.query("field == 'True'"), which is pretty horrendous lol.

FYI - I don't want to do DataFrame.astype(dict(field=bool)), because then i lose the memory efficiency from category. i want to keep the category dtype.


Solution

  • Maybe you can try:

    df['field'] = df['field'].replace({'True': True, 'False': False})
    print(df['field'])
    
    # Output
    0    False
    1     True
    2     True
    3    False
    Name: field, dtype: category
    Categories (2, object): [False, True]  # <- bool
    

    With query:

    >>> df.query('field == True')
      field
    1  True
    2  True
    

    Setup:

    df = pd.DataFrame({'field': ['False', 'True', 'True', 'False']}, dtype='category')
    print(df['field'])
    
    # Output
    0    False
    1     True
    2     True
    3    False
    Name: field, dtype: category
    Categories (2, object): ['False', 'True']  # <- str