Search code examples
pythonpandasstatisticsdata-analysiscovariance

Covariance of two columns of a dataframe


Please forgive this question if it sounds too trivial, but I want to be sure I'm on the right track. I have a data frame similar to the following, and I'm interested in understanding whether the two variables A and B vary together or otherwise.

       A       B
0   34.4534 35.444248
1   34.8915 24.693800
2   0.0000  21.586316
3   34.7767 23.783602

I am asked to plot a covariance between the two. However, from my research, it seems covariance is a single-calculated value just like mean and standard deviation, not a distribution like pdf/cdf that one can plot.

Is my perception about covariance right? What advice could you give me for some other way to understand the variability between these variables?


Solution

  • Is your perception right? - Yes

    Covariance is a measure of the joint variability of two random variables and is represented by one number. This number is

    • positive if they "behave similar" (which means roughly that positive peaks in variable 1 coincide with positive peaks in variable 2)
    • zero if they do not covary
    • negative if they "behave similar" but with an inverse relationship (that is, negative peaks align with positive peaks and vice versa)
    import pandas as pd
    
    # create 3 random variables; var 3 is based on var 1, so they should covary
    data = np.random.randint(-9,9,size=(20,3))
    data[:,2] = data[:,0] + data[:,2]*0.5
    
    df = pd.DataFrame(data,columns=['var1','var2','var3'])
    df.plot(marker='.')
    
    

    enter image description here

    We see that var1 and var3 seem to covary; so in order to compute the covariance between all variables, pandas comes in handy:

    >>> df.cov()
    
              var1       var2       var3
    var1  31.326316  -5.389474  30.684211
    var2  -5.389474  21.502632 -10.907895
    var3  30.684211 -10.907895  37.776316
    
    

    Since the actual values of covariance depend on the scale of your input variables, you typically normalize the covariance by the respective standard deviations which gives you the correlation as a measure of covariance, ranging from -1 (anticorrelated) to 1 (correlated). With pandas, this reads

    >>> df.corr()
              var1      var2      var3
    var1  1.000000 -0.207657  0.891971
    var2 -0.207657  1.000000 -0.382724
    var3  0.891971 -0.382724  1.000000
    
    

    from which it becomes clear, that var1 and var3 exhibit a strong correlation, exactly as we expect it to be.


    What advice could you give me for some other way to understand the variability between these variables? - Depends on the data

    Since we don't know anything about the nature of your data, this is hard to say. Perhaps just as a starter (without intending to be exhaustive), some hints at what you could look at:

    • Spearman's rank correlation: more robust than Pearson correlation coefficient, what we have used above; Pearson basically only looks at linear correlation and produces less correct result if your data exhibits some sort of non-linearity; in the case of possible non-linear relationships in your data, you should go for Spearman
    • Autocorrelation: think about a sinusoidal signal which triggers another signal but with a time lag of 90º (which represents a cosine). In that case, the typical covariance/correlation will tell you that the relationship is weak, and may (falsely) lead you to the conclusion that there is no causal effect between both signals. Autocorrelation basically is the correlation for shifted versions of your time series, thus allowing to detect lagged correlation.
    • probably much more, but perhaps that's good for a start