Please forgive this question if it sounds too trivial, but I want to be sure I'm on the right track. I have a data frame similar to the following, and I'm interested in understanding whether the two variables A and B vary together or otherwise.
A B
0 34.4534 35.444248
1 34.8915 24.693800
2 0.0000 21.586316
3 34.7767 23.783602
I am asked to plot a covariance between the two. However, from my research, it seems covariance is a single-calculated value just like mean and standard deviation, not a distribution like pdf/cdf that one can plot.
Is my perception about covariance right? What advice could you give me for some other way to understand the variability between these variables?
Covariance is a measure of the joint variability of two random variables and is represented by one number. This number is
import pandas as pd
# create 3 random variables; var 3 is based on var 1, so they should covary
data = np.random.randint(-9,9,size=(20,3))
data[:,2] = data[:,0] + data[:,2]*0.5
df = pd.DataFrame(data,columns=['var1','var2','var3'])
df.plot(marker='.')
We see that var1
and var3
seem to covary; so in order to compute the covariance between all variables, pandas
comes in handy:
>>> df.cov()
var1 var2 var3
var1 31.326316 -5.389474 30.684211
var2 -5.389474 21.502632 -10.907895
var3 30.684211 -10.907895 37.776316
Since the actual values of covariance depend on the scale of your input variables, you typically normalize the covariance by the respective standard deviations which gives you the correlation as a measure of covariance, ranging from -1 (anticorrelated) to 1 (correlated). With pandas, this reads
>>> df.corr()
var1 var2 var3
var1 1.000000 -0.207657 0.891971
var2 -0.207657 1.000000 -0.382724
var3 0.891971 -0.382724 1.000000
from which it becomes clear, that var1
and var3
exhibit a strong correlation, exactly as we expect it to be.
Since we don't know anything about the nature of your data, this is hard to say. Perhaps just as a starter (without intending to be exhaustive), some hints at what you could look at: