I am using scipy.stats.spearmanr to calculate Spearman's Rank Correlation of 2 ordinal variables. I wasn't sure whether to encode them or not. I tried it both ways and the function seems to spit out results regardless. So I am not sure which way to go.
from scipy import stats
# dummy data comparing one ordinal variable with another
print(stats.spearmanr(['always','never','sometimes','always'], ['high','medium','low','low']))
>> SpearmanrResult(correlation=0.5000000000000001, pvalue=0.4999999999999999)
# encoding
print(stats.spearmanr([3,1,2,3], [3,2,1,1]))
>> SpearmanrResult(correlation=0.05555555555555556, pvalue=0.9444444444444444)
Unless the alphabetical order of your data is equal to the intended order, you should encode your variables.
Internally, SciPy is ordering your data to conduct the tests. In case of integers, their order is obviously equal to the values of your data, e.g. 1 < 2 < 3
. In case of strings, their order is most likely their alphabetical order, e.g. a < b < c
.
In your case, the intended orders are probably
never < sometimes < always
low < medium < high
however sorting these lists of values alphabetically yields the (most likely incorrect) orders
always < never < sometimes
high < low < medium
If you manually encode this list to integer or correctly sortable string values, you correct this problem:
import scipy
# Incorrect alphabetical order
scipy.stats.spearmanr(['always','never','sometimes','always'], ['high','medium','low','low'])
# SpearmanrResult(correlation=0.5000000000000001, pvalue=0.4999999999999999)
# Incorrect integer order
scipy.stats.spearmanr([1,2,3,1], [1,3,2,2])
# SpearmanrResult(correlation=0.5000000000000001, pvalue=0.4999999999999999)
# Correct integer order
scipy.stats.spearmanr([3,1,2,3], [3,2,1,1])
# SpearmanrResult(correlation=0.05555555555555556, pvalue=0.9444444444444444)
# Correct alphabetical order
scipy.stats.spearmanr(['c','a','b','c'], ['c','b','a','a'])
# SpearmanrResult(correlation=0.05555555555555556, pvalue=0.9444444444444444)