Search code examples
pythonpandasdataframecategoriesdummy-variable

Re-categorize a column in a pandas dataframe


I am trying to build a simple classification model for my data stored in the pandas dataframe train. To make this model more efficient, I created a list of column names of columns I know to store categorical data, called category_cols. I categorize these columns as follows:

# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')

# Convert train[category_cols] to a categorical type
train[category_cols] = train[category_cols].apply(categorize_label, axis=0)

My target variable, material, is categorical, and has 64 unique labels it can be assigned to. However, some of these labels only appear once in train, which is too few to train the model well. So I'd like to filter any observations in train that have these rare material labels. This answer provided a useful groupby+filter combination:

print('Num rows: {}'.format(train.shape[0]))
print('Material labels: {}'.format(len(train['material'].unique())))

min_count = 5
filtered = train.groupby('material').filter(lambda x: len(x) > min_count)
print('Num rows: {}'.format(filtered.shape[0]))
print('Material labels: {}'.format(len(filtered['material'].unique())))
----------------------
Num rows: 19999
Material labels: 64
Num rows: 19963
Material labels: 45

This works great in that it does filter the observations with rare material labels. However, something under the hood in the category type seems to maintain all the previous values for material even after they've been filtered. This becomes a problem when trying to create dummy variables, and happens even if I try to rerun my same categorize method:

filtered[category_cols] = filtered[category_cols].apply(categorize_label, axis=0)
print(pd.get_dummies(train['material']).shape)
print(pd.get_dummies(filtered['material']).shape)
----------------------
(19999, 64)
(19963, 64)

I would have expected the shape of the filtered dummies to be (19963, 45). However, pd.get_dummies includes columns for labels that have no appearances in filtered. I assume this has something to do with how the category type works. If so, could someone please explain how to re-categorize a column? Or if that is not possible, how to get rid of the unnecessary columns in the filtered dummies?

Thank you!


Solution

  • You can use category.cat.remove_unused_categories:

    Usage

    df['category'].cat.remove_unused_categories(inplace=True)
    

    Example

    df = pd.DataFrame({'label': list('aabbccd'),
                       'value': [1] * 7})
    print(df)
    
      label  value
    0     a      1
    1     a      1
    2     b      1
    3     b      1
    4     c      1
    5     c      1
    6     d      1
    

    Lets set label as type category

    df['label'] = df.label.astype('category')
    print(df.label)
    
    0    a
    1    a
    2    b
    3    b
    4    c
    5    c
    6    d
    Name: label, dtype: category
    Categories (4, object): [a, b, c, d]
    

    Filter DataFrame to remove label d

    df = df[df.label.ne('d')]
    print(df)
    
      label  value
    0     a      1
    1     a      1
    2     b      1
    3     b      1
    4     c      1
    5     c      1
    

    Remove unused_categories

    df.label.cat.remove_unused_categories(inplace=True)
    print(df.label)
    
    0    a
    1    a
    2    b
    3    b
    4    c
    5    c
    Name: label, dtype: category
    Categories (3, object): [a, b, c]