Search code examples
pythonpandasstatisticscorrelation

How to calculate correlation between binary variables in python?


Two binary variables (x and y) form two columns for a number of dates in a pandas Dataframe. I want to calculate a correlation score between x and y that quantifies how correlated x=1 is with y=1 ( x=0 with y=0).

  1. What definition of correlation is appropriate?

  2. Is there a built-in function?

    day _x _ y
    0 1 1
    1 1 0
    2 0 0
    3 1 1

Explanation: These are two categoricals. say, x = had eggs for breakfast (0 or 1) and y = got a headache (0 or 1). And there data from several days for both x and y. I'm trying to see how 'strongly correlated' having an eggs and having a headache are. I understand that Pearson's correlation is not applicable here. What could be used?


Solution

  • The correlation metric to use in this case is Pearson's rho. Defined for two binary variables, it is also known as Pearson's correlation coeffecient.

    rho = (n11*n00 -  n10*n01)/sqrt(n11.n10.n01.n00)
    where 
    n11 (n00) = number of rows with x=1(0) and y=1(0) etc. 
    

    https://en.wikipedia.org/wiki/Phi_coefficient