Search code examples
scipycorrelationpearson-correlationscipy.stats

Should I encode my ordinal variables before calculating Spearmans Rank Correlation (scipy)?


I am using scipy.stats.spearmanr to calculate Spearman's Rank Correlation of 2 ordinal variables. I wasn't sure whether to encode them or not. I tried it both ways and the function seems to spit out results regardless. So I am not sure which way to go.

from scipy import stats

# dummy data comparing one ordinal variable with another
print(stats.spearmanr(['always','never','sometimes','always'], ['high','medium','low','low']))
>> SpearmanrResult(correlation=0.5000000000000001, pvalue=0.4999999999999999)

# encoding
print(stats.spearmanr([3,1,2,3], [3,2,1,1]))
>> SpearmanrResult(correlation=0.05555555555555556, pvalue=0.9444444444444444)

Solution

  • Unless the alphabetical order of your data is equal to the intended order, you should encode your variables.

    Internally, SciPy is ordering your data to conduct the tests. In case of integers, their order is obviously equal to the values of your data, e.g. 1 < 2 < 3. In case of strings, their order is most likely their alphabetical order, e.g. a < b < c.

    In your case, the intended orders are probably

    never < sometimes < always
    low < medium < high
    

    however sorting these lists of values alphabetically yields the (most likely incorrect) orders

    always < never < sometimes
    high < low < medium
    

    If you manually encode this list to integer or correctly sortable string values, you correct this problem:

    import scipy
    
    # Incorrect alphabetical order
    scipy.stats.spearmanr(['always','never','sometimes','always'], ['high','medium','low','low'])
    # SpearmanrResult(correlation=0.5000000000000001, pvalue=0.4999999999999999)
    
    # Incorrect integer order
    scipy.stats.spearmanr([1,2,3,1], [1,3,2,2])
    # SpearmanrResult(correlation=0.5000000000000001, pvalue=0.4999999999999999)
    
    # Correct integer order
    scipy.stats.spearmanr([3,1,2,3], [3,2,1,1])
    # SpearmanrResult(correlation=0.05555555555555556, pvalue=0.9444444444444444)
    
    # Correct alphabetical order
    scipy.stats.spearmanr(['c','a','b','c'], ['c','b','a','a'])
    # SpearmanrResult(correlation=0.05555555555555556, pvalue=0.9444444444444444)