I have a Pandas version 0.15.2 dataframe as below with an ordinal column rate
, represented initially as strings. My end goal is to find the similarities of different rows in the df
(in reality I have a lot more rows and more ordinal variables).
Currently, to factorize() while enforcing the ordering I want, I first convert rate
to Categorical, then sort the df
and finally factorize so Bad
gets the smallest integer value and Good
gets the highest. In this fashion I create the factor_rate
column, which I can then use as a dimension (among others) to calculate similarities / distances. These are given in the code below.
However, I feel this is a memory inefficient and slow hack. What is the fastest and most memory-efficient way to achieve this?
df = pd.DataFrame({'id' : range(1,9),
'rate' : ['bad', 'bad', 'neutral', 'good',
'neutral', 'good', 'bad', 'neutral'],
'amount' : np.random.randn(8)}, columns= ['id','rate','amount'])
df.rate = pd.Categorical(df.rate, categories=["bad","neutral","good"], ordered=True)
df = df.sort('rate', ascending=True)
df.loc[:,'factor_rate'] = pd.factorize(df.rate)[0]
Note 1: This question is linked to my previous one.
Note 2: I know that pandas.factorize()
had an order
parameter which is now or soon will be deprecated, so I cannot use it directly.
Note 3: I am aware of the Gower metric and that ordinal variables should be treated differently to purely numeric ones in similarity calculations, but that is not the point of my question.
If you're concerned or reliant on categorical then another approach is to define your categories in a list and an order, use this to create a dict to map the order to the categories and pass this dict to map
:
In [560]:
df = pd.DataFrame({'id' : range(1,9),
'rate' : ['bad', 'bad', 'neutral', 'good',
'neutral', 'good', 'bad', 'neutral'],
'amount' : np.random.randn(8)}, columns= ['id','rate','amount'])
cat_list = ["bad","neutral","good"]
cat_dict = dict(zip(cat_list, np.arange(len(cat_list))))
df['factor_rate'] = df['rate'].map(cat_dict)
df
Out[560]:
id rate amount factor_rate
0 1 bad -1.436574 0
1 2 bad 0.210717 0
2 3 neutral -0.896904 1
3 4 good -1.724393 2
4 5 neutral 1.792339 1
5 6 good -1.312713 2
6 7 bad 0.555877 0
7 8 neutral -0.681881 1
sorting by the new 'factor_rate' column produces the same output as your current solution:
In [562]:
df.sort('factor_rate')
Out[562]:
id rate amount factor_rate
0 1 bad -1.436574 0
1 2 bad 0.210717 0
6 7 bad 0.555877 0
2 3 neutral -0.896904 1
4 5 neutral 1.792339 1
7 8 neutral -0.681881 1
3 4 good -1.724393 2
5 6 good -1.312713 2
EDIT
Actually the underlying coding is stored in the Category dtype:
In [569]:
df = pd.DataFrame({'id' : range(1,9),
'rate' : ['bad', 'bad', 'neutral', 'good',
'neutral', 'good', 'bad', 'neutral'],
'amount' : np.random.randn(8)}, columns= ['id','rate','amount'])
df['rate'] = pd.Categorical(df.rate, categories=["bad","neutral","good"], ordered=True)
df['rate'].cat.codes
Out[569]:
0 0
1 0
2 1
3 2
4 1
5 2
6 0
7 1
dtype: int8
So my answer may not be necessary, the link to the docs is empty at the moment.