I understand how to use factorize to encode levels of a factor, such as "L" and "W" (for wins and loses) into numeric values, such as "0" and "1":
import pandas as pd
first_df = pd.DataFrame({'outcome': ["L", "L", "W", "W"]})
pd.factorize(first_df['outcome'])
The above returns (array([0, 0, 1, 1]), array(['L', 'W'], dtype=object))
.
However, later on, I'd like to combine this result with some other results, where we now have a new outcome, a draw ("D"), and here is where things get sticky:
second_df = pd.DataFrame({'outcome': ["L", "L", "D", "D"]})
pd.factorize(second_df['outcome'])
This returns (array([0, 0, 1, 1]), array(['L', 'D'], dtype=object))
I need some way to preemptively declare the fact that there are 3 different levels when I create the dataframes, and map the correct numeric value to the correct level. How can I achieve this?
Something like this is definitely possible using a Categorical
:
outcome_cat = pd.Categorical(
first_df['outcome'],
categories=['L', 'W', 'D'], ordered=False
)
The semantics of Categorical
s may not be exactly the same as the output of pd.factorize()
, but the codes
attribute contains your data as numeric values, it's just that the Categorical
is also aware of the unobserved 'D'
value:
outcome_cat.codes
Out[6]: array([0, 0, 1, 1], dtype=int8)