Search code examples
pythonpandasdistancesimilaritycategorical-data

Pandas Ordinal Variable Treatment in Similarity Calculation


I have a Pandas version 0.15.2 dataframe as below with an ordinal column rate, represented initially as strings. My end goal is to find the similarities of different rows in the df (in reality I have a lot more rows and more ordinal variables).

Currently, to factorize() while enforcing the ordering I want, I first convert rate to Categorical, then sort the df and finally factorize so Bad gets the smallest integer value and Good gets the highest. In this fashion I create the factor_rate column, which I can then use as a dimension (among others) to calculate similarities / distances. These are given in the code below.

However, I feel this is a memory inefficient and slow hack. What is the fastest and most memory-efficient way to achieve this?

df = pd.DataFrame({'id' : range(1,9),
                   'rate' : ['bad', 'bad', 'neutral', 'good',
                                'neutral', 'good', 'bad', 'neutral'],
                   'amount' : np.random.randn(8)},  columns= ['id','rate','amount'])

df.rate = pd.Categorical(df.rate, categories=["bad","neutral","good"], ordered=True)

df = df.sort('rate', ascending=True)
df.loc[:,'factor_rate'] = pd.factorize(df.rate)[0]

Note 1: This question is linked to my previous one.

Note 2: I know that pandas.factorize() had an order parameter which is now or soon will be deprecated, so I cannot use it directly.

Note 3: I am aware of the Gower metric and that ordinal variables should be treated differently to purely numeric ones in similarity calculations, but that is not the point of my question.


Solution

  • If you're concerned or reliant on categorical then another approach is to define your categories in a list and an order, use this to create a dict to map the order to the categories and pass this dict to map:

    In [560]:
    
    df = pd.DataFrame({'id' : range(1,9),
                       'rate' : ['bad', 'bad', 'neutral', 'good',
                                    'neutral', 'good', 'bad', 'neutral'],
                       'amount' : np.random.randn(8)},  columns= ['id','rate','amount'])
    cat_list = ["bad","neutral","good"]
    cat_dict = dict(zip(cat_list, np.arange(len(cat_list))))
    
    df['factor_rate'] = df['rate'].map(cat_dict)
    df
    Out[560]:
       id     rate    amount  factor_rate
    0   1      bad -1.436574            0
    1   2      bad  0.210717            0
    2   3  neutral -0.896904            1
    3   4     good -1.724393            2
    4   5  neutral  1.792339            1
    5   6     good -1.312713            2
    6   7      bad  0.555877            0
    7   8  neutral -0.681881            1
    

    sorting by the new 'factor_rate' column produces the same output as your current solution:

    In [562]:
    
    df.sort('factor_rate')
    Out[562]:
       id     rate    amount  factor_rate
    0   1      bad -1.436574            0
    1   2      bad  0.210717            0
    6   7      bad  0.555877            0
    2   3  neutral -0.896904            1
    4   5  neutral  1.792339            1
    7   8  neutral -0.681881            1
    3   4     good -1.724393            2
    5   6     good -1.312713            2
    

    EDIT

    Actually the underlying coding is stored in the Category dtype:

    In [569]:
    
    df = pd.DataFrame({'id' : range(1,9),
                       'rate' : ['bad', 'bad', 'neutral', 'good',
                                    'neutral', 'good', 'bad', 'neutral'],
                       'amount' : np.random.randn(8)},  columns= ['id','rate','amount'])
    ​
    df['rate'] = pd.Categorical(df.rate, categories=["bad","neutral","good"], ordered=True)
    df['rate'].cat.codes
    Out[569]:
    0    0
    1    0
    2    1
    3    2
    4    1
    5    2
    6    0
    7    1
    dtype: int8
    

    So my answer may not be necessary, the link to the docs is empty at the moment.