Search code examples
pythonpandascategorical-data

How to convert categorial data to indices and print assignment?


Context I have a series with categorial data. My goal is to convert it to indices like in the example above. However, there are two other requirements:

  • nan values should stay nan and not converted to an index e.g. -1
  • I would like to print the assignment of category to index

Code

red    -> 0
blue   -> 1
green  -> 2
nan    -> nan
red    -> 0
yellow -> 3
green  -> 2
nan    -> nan

series = series.astype('category').cat.codes

Question

How can I achieve this goal?


Solution

  • The -1 used in categorical data is there for efficiency, either use a categorical and don't mess with the internals, or use a custom order and map your own values.

    categorical

    You can use ordered categories as the codes will be used in order (first one is 0, second is 1, etc.), but NaN will be -1:

    df['col'] = pd.Categorical(df['col'], ordered=True,
                               categories=['red', 'blue', 'green', 'yellow'])
    

    Example:

    df = pd.DataFrame({'col': ['blue', 'red', 'yellow', np.nan]})
    
    df['col'] = pd.Categorical(df['col'], ordered=True,
                               categories=['red', 'blue', 'green', 'yellow'])
    
    print(df['col'].cat.codes)
    

    Output:

    0    1
    1    0
    2    3
    3   -1
    dtype: int8
    

    custom values

    If you really need NaN as NaN, then a Categorical is not appropriate, instead use a map:

    df['col'] = df['col'].map({'red': 0, 'blue': 1, 'green': 2, 'yellow': 3})
    print(df)
    

    Or, automatically:

    order = ['red', 'blue', 'green', 'yellow']
    
    df['col'] = df['col'].map({k: v for v, k in enumerate(order)})
    
    print(df)
    

    Output:

       col
    0  1.0
    1  0.0
    2  3.0
    3  NaN