Search code examples
pandasdataframedata-sciencecorrelationpearson-correlation

Why is the correlation one when values differ?


I have a dataframe book_matrix with users as rows, books as columns, and ratings as values. When I use corrwith() to compute the correlation between 'The Lord of the Rings' and 'The Silmarillion' the result is 1.0, but the values are clearly different.

enter image description here

The non-null values [10, 3] and [10, 9] have correlation 1.0. I would expect them to be exactly the same when the correlation is equal to one. How can this happen?


Solution

  • Correlation means the values have a certain relationship with one another, for example linear combination of factors. Here's an illustration:

    import pandas as pd
      
    df1 = pd.DataFrame({"A":[1, 2, 3, 4], 
                        "B":[5, 8, 4, 3],
                        "C":[10, 4, 9, 3]})
      
    df2 = pd.DataFrame({"A":[2, 4, 6, 8],
                        "B":[-5, -8, -4, -3],
                        "C":[4, 3, 8, 5]})
    
    df1.corrwith(df2, axis=0)
    
    A    1.000000
    B   -1.000000
    C    0.395437
    dtype: float64
    

    So you can see that [1, 2, 3, 4] and [2, 4, 6, 8] have correlation 1.0

    The next column [5, 8, 4, 3] and [-5, -8, -4, -3] have extreme negative correlation -1.0

    In the last column, [10, 4, 9, 3] and [4, 3, 8, 5] are somewhat correlated 0.395437, because both exhibits high-low-high-low sequence but with varying vertical scaling factors.

    So in your case both books 'The Lord of the Rings' and 'The Silmarillion' only has 2 ratings each, and both ratings are having high-low sequence. Even if I illustrate with more data points, they have the same vertical scaling factor.

    df1 = pd.DataFrame({"A": [10, 3, 10, 3, 10, 3],
                        "B": [10, 3, 10, 3, 10, 3]})
    df2 = pd.DataFrame({"A": [10, 9, 10, 9, 10, 9],
                        "B": [10, 10, 10, 9, 9, 9]})
    
    df1.corrwith(df2, axis=0)
    
    A    1.000000
    B    0.333333
    dtype: float64
    

    So you can see that [10, 3, 10, 3, 10, 3] and [10, 9, 10, 9, 10, 9] are also correlated perfectly at 1.0.

    But if I rearrange the sequence a little, [10, 3, 10, 3, 10, 3] and [10, 10, 10, 9, 9, 9] are not perfectly correlated anymore at 0.333333

    So going forward, you need more data, and more variations in the data! Hope that helps 😎