I am trying to build a simple classification model for my data stored in the pandas dataframe train
. To make this model more efficient, I created a list of column names of columns I know to store categorical data, called category_cols
. I categorize these columns as follows:
# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')
# Convert train[category_cols] to a categorical type
train[category_cols] = train[category_cols].apply(categorize_label, axis=0)
My target variable, material
, is categorical, and has 64 unique labels it can be assigned to. However, some of these labels only appear once in train
, which is too few to train the model well. So I'd like to filter any observations in train
that have these rare material labels. This answer provided a useful groupby+filter combination:
print('Num rows: {}'.format(train.shape[0]))
print('Material labels: {}'.format(len(train['material'].unique())))
min_count = 5
filtered = train.groupby('material').filter(lambda x: len(x) > min_count)
print('Num rows: {}'.format(filtered.shape[0]))
print('Material labels: {}'.format(len(filtered['material'].unique())))
----------------------
Num rows: 19999
Material labels: 64
Num rows: 19963
Material labels: 45
This works great in that it does filter the observations with rare material labels. However, something under the hood in the category
type seems to maintain all the previous values for material
even after they've been filtered. This becomes a problem when trying to create dummy variables, and happens even if I try to rerun my same categorize method:
filtered[category_cols] = filtered[category_cols].apply(categorize_label, axis=0)
print(pd.get_dummies(train['material']).shape)
print(pd.get_dummies(filtered['material']).shape)
----------------------
(19999, 64)
(19963, 64)
I would have expected the shape of the filtered dummies to be (19963, 45). However, pd.get_dummies
includes columns for labels that have no appearances in filtered
. I assume this has something to do with how the category
type works. If so, could someone please explain how to re-categorize a column? Or if that is not possible, how to get rid of the unnecessary columns in the filtered dummies?
Thank you!
You can use category.cat.remove_unused_categories
:
df['category'].cat.remove_unused_categories(inplace=True)
df = pd.DataFrame({'label': list('aabbccd'),
'value': [1] * 7})
print(df)
label value
0 a 1
1 a 1
2 b 1
3 b 1
4 c 1
5 c 1
6 d 1
Lets set label
as type category
df['label'] = df.label.astype('category')
print(df.label)
0 a
1 a
2 b
3 b
4 c
5 c
6 d
Name: label, dtype: category
Categories (4, object): [a, b, c, d]
Filter DataFrame
to remove label
d
df = df[df.label.ne('d')]
print(df)
label value
0 a 1
1 a 1
2 b 1
3 b 1
4 c 1
5 c 1
Remove unused_categories
df.label.cat.remove_unused_categories(inplace=True)
print(df.label)
0 a
1 a
2 b
3 b
4 c
5 c
Name: label, dtype: category
Categories (3, object): [a, b, c]