Search code examples

How to convert categorial data into indices with nan values present in Python?


I have created a function, that converts Categorial Data into its unique indices. This works great with all values except NaN. It seems that the comparison with NaN does not work. This results in the two problems seen below.


0  male
1  female
2  NaN
3  female

def categorial(series: pandas.Series) -> pandas.Series:
    series = series.copy()

    for index, value in enumerate(series.unique()):
        # Problem 1: The output for the Value NaN is always 0.0 %, even though nan is present in the given series.
        print(index, value, round(series[series == value].count() / len(series) * 100, 2), '%')

    for index, value in enumerate(series.unique()):
        # Problem 2: Every unique Value is converted to its Index except NaN.
        series[series == value] = index

    return series.astype(pandas.Int64Dtype())


  • How can I solve the two problems seen in the code above?


  • How should be encoded missing values nans?

    In pandas it is obviously -1:

    print (pd.factorize(categorial(df['col1']))[0])
    [ 0  1 -1  1]
    print (df['col1'].astype('category')
    0    1
    1    0
    2   -1
    3    0
    dtype: int8