Issue with the `apply` method in `Pandas` when it is used on a subset of rows and applied to a dictionary on a column

I have a Pandas dataframe, and I obtain a subset of its rows. Then, I use the apply method to transform a column of type category using a dictionary to convert it to integer values. However, it seems that the transformation is being applied to all data in the category list, even those that are not included in the subset of the dataframe. Everything works perfectly when the column type is not set as a category. The provided MWE demonstrates this issue:

df = pd.DataFrame({'A':[11,22,33], 'B':['foo','baz', 'bar']})
df['B'] = df['B'].astype('category')
dfs = df[df['A'] > 20].reset_index(drop=True)
d = {'baz': 14, 'bar': 19}
dfs['B'].apply(lambda x: d[x])

Although 'foo' is not included in dfs, I receive a KeyError: 'foo' error! However, when I don't set the column as a category, the code works:

df = pd.DataFrame({'A':[11,22,33], 'B':['foo','baz', 'bar']})
dfs = df[df['A'] > 20].reset_index(drop=True)
d = {'baz': 14, 'bar': 19}
dfs['B'].apply(lambda x: d[x])

In this case, I get the following output:

0    14
1    19
Name: B, dtype: int64

I don't understand why, in the case of a category, the apply method functions on all data of the category and not only on the data included in the subset.

Solution

Categories in Pandas are set initially to save memory. Internally the data is stored as an array for the categories and the data is saved as an integer array with each integer pointing to the actual value in the category array (https://pandas.pydata.org/docs/user_guide/categorical.html#categorical-data).

Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array.

The idea is when a large dataset has fixed values, to set the categories and then when data is modified it acts the same as it would with originally, such as ordering rules being preserved when data is removed and then added back in. Since the category for dfs is set with df['B'] where the available values were 'foo', 'bar', and 'baz', those are set as the categories and won't be changed when the data is changed. Conversely, if new data (a new value for the 'B' column) is added that wasn't in the original category the column turns back into objects.

If data is modified to have less categories and you want to set those as the current set of categories you can use remove_unused_categories and that will reset the categories.

df = pd.DataFrame({'A':[11,22,33], 'B':['foo','baz', 'bar']})
df['B'] = df['B'].astype('category')
dfs = df[df['A'] > 20].reset_index(drop=True)
d = {'baz': 14, 'bar': 19}

dfs['B'] = dfs['B'].cat.remove_unused_categories()

dfs['B'].apply(lambda x: d[x])

This is referenced in the pandas docs at https://pandas.pydata.org/docs/user_guide/categorical.html#getting where it states that

If the slicing operation returns either a DataFrame or a column of type Series, the category dtype is preserved.

Since the line df[df['A'] > 20] is just slicing and reset_index does not affect dtypes, the category dtype remains.