Is there a way of using df['Column'].astype('category')
and df['Column'].cat.reorder_categories()
to list NaN in one of the positions? .astype()
doesn't appear to affect NaN values in my dataframe.
Basically for df['Column'].unique() I have:
['Moderate' 'Liberal' 'Somewhat Conservative' 'Somewhat liberal' 'Very Liberal' 'Very Conservative' 'Conservative' nan]
And I would like to get it to:
['Very Liberal' < 'Liberal' < 'Somewhat liberal' < 'Moderate' < 'Somewhat Conservative' < 'Conservative' < 'Very Conservative' < nan]
I have tried:
df['Column'] = df['Column'].astype('category')
df['Column'] = df['Column'].cat.reorder_categories(['Very Liberal', 'Liberal', 'Somewhat liberal', 'Moderate', 'Somewhat Conservative', 'Conservative', 'Very Conservative', np.nan], ordered=True)
But it throws the error "ValueError: items in new_categories are not the same as in old categories" indicating that np.nan doesn't exist in the categories.
So I guess I'm wondering how to specify/represent NaN as a category, and how to order it within categories of a column.
Your error comes from missing non-NA categories in your column. You need to add them with add_categories
You should however not add NaN as category, NaN is always a possible category with code -1
. Thus NaN is not directly orderable within the categories. You can however chose the NaN ordering position in sort_values
and the na_position='last'
parameter.
order = ['Very Liberal', 'Liberal', 'Somewhat liberal', 'Moderate', 'Somewhat Conservative', 'Conservative', 'Very Conservative']
df['Column'] = (df['Column']
.cat.add_categories(set(order).difference(df['Column'].cat.categories))
.cat.reorder_categories(order, ordered=True)
)
Now let's sort:
df['Column'].sort_values(na_position='last')
If you really want an orderable NaN, use a placeholder string such as 'NAN'
and set it as category.