Search code examples
pandascategorical-data

Cleanly create new column of categorical data


I can add a categorical column to a Pandas DataFrame like so:

import pandas as pd

label_type = pd.api.types.CategoricalDtype(categories=["positive", "negative"], ordered=False)

d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)

# Create a new column, setting the value universally to "positive"
df['label'] = pd.Series(["positive"] * len(df), dtype=label_type).values

This is less elegant than this shorthand with other types:

df['label2'] = "positive"  # sets entire column to str("positive")

but it seems like the underlying type is just a str

print(type(df['label'].iloc[0]))
<class 'str'>

so it seems like the column-type has to be known ahead of time to pandas.

Is there any way to add a categorical column to a dataframe without manually constructing the Series? For example,

df['label3'] = label_type("positive")

Solution

  • How about this:

    df['col4'] = df.assign(col4 = 'positive')['col4'].astype(label_type)
    
    df.info()
    
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 2 entries, 0 to 1
    Data columns (total 4 columns):
     #   Column  Non-Null Count  Dtype   
    ---  ------  --------------  -----   
     0   col1    2 non-null      int64   
     1   col2    2 non-null      int64   
     2   label   2 non-null      category
     3   col4    2 non-null      category
    dtypes: category(2), int64(2)
    memory usage: 412.0 bytes
    

    Though you still get an str type:

    type(df['col4'].iloc[0])
    
    str
    

    Since I think that in this case iloc[] will return a string representation of the category.

    Or just do it in two steps:

    df['col4'] = 'positive'
    df['col4'] = df['col4'].astype(label_type)