scipy correlation pearson-correlation scipy.stats

Should I encode my ordinal variables before calculating Spearmans Rank Correlation (scipy)?

I am using scipy.stats.spearmanr to calculate Spearman's Rank Correlation of 2 ordinal variables. I wasn't sure whether to encode them or not. I tried it both ways and the function seems to spit out results regardless. So I am not sure which way to go.

from scipy import stats

# dummy data comparing one ordinal variable with another
print(stats.spearmanr(['always','never','sometimes','always'], ['high','medium','low','low']))
>> SpearmanrResult(correlation=0.5000000000000001, pvalue=0.4999999999999999)

# encoding
print(stats.spearmanr([3,1,2,3], [3,2,1,1]))
>> SpearmanrResult(correlation=0.05555555555555556, pvalue=0.9444444444444444)

Solution

Unless the alphabetical order of your data is equal to the intended order, you should encode your variables.

Internally, SciPy is ordering your data to conduct the tests. In case of integers, their order is obviously equal to the values of your data, e.g. 1 < 2 < 3. In case of strings, their order is most likely their alphabetical order, e.g. a < b < c.

In your case, the intended orders are probably

never < sometimes < always
low < medium < high

however sorting these lists of values alphabetically yields the (most likely incorrect) orders

always < never < sometimes
high < low < medium

If you manually encode this list to integer or correctly sortable string values, you correct this problem:

import scipy

# Incorrect alphabetical order
scipy.stats.spearmanr(['always','never','sometimes','always'], ['high','medium','low','low'])
# SpearmanrResult(correlation=0.5000000000000001, pvalue=0.4999999999999999)

# Incorrect integer order
scipy.stats.spearmanr([1,2,3,1], [1,3,2,2])
# SpearmanrResult(correlation=0.5000000000000001, pvalue=0.4999999999999999)

# Correct integer order
scipy.stats.spearmanr([3,1,2,3], [3,2,1,1])
# SpearmanrResult(correlation=0.05555555555555556, pvalue=0.9444444444444444)

# Correct alphabetical order
scipy.stats.spearmanr(['c','a','b','c'], ['c','b','a','a'])
# SpearmanrResult(correlation=0.05555555555555556, pvalue=0.9444444444444444)