Context I have a series with categorial data. My goal is to convert it to indices like in the example above. However, there are two other requirements:
Code
red -> 0
blue -> 1
green -> 2
nan -> nan
red -> 0
yellow -> 3
green -> 2
nan -> nan
series = series.astype('category').cat.codes
Question
How can I achieve this goal?
The -1 used in categorical data is there for efficiency, either use a categorical and don't mess with the internals, or use a custom order and map your own values.
You can use ordered categories as the codes will be used in order (first one is 0, second is 1, etc.), but NaN will be -1:
df['col'] = pd.Categorical(df['col'], ordered=True,
categories=['red', 'blue', 'green', 'yellow'])
Example:
df = pd.DataFrame({'col': ['blue', 'red', 'yellow', np.nan]})
df['col'] = pd.Categorical(df['col'], ordered=True,
categories=['red', 'blue', 'green', 'yellow'])
print(df['col'].cat.codes)
Output:
0 1
1 0
2 3
3 -1
dtype: int8
If you really need NaN as NaN, then a Categorical
is not appropriate, instead use a map
:
df['col'] = df['col'].map({'red': 0, 'blue': 1, 'green': 2, 'yellow': 3})
print(df)
Or, automatically:
order = ['red', 'blue', 'green', 'yellow']
df['col'] = df['col'].map({k: v for v, k in enumerate(order)})
print(df)
Output:
col
0 1.0
1 0.0
2 3.0
3 NaN