Search code examples
pythonpandasfrequency-analysisfrequency-distribution

How to create a Frequency Distribution Matrix from a Pandas DataFrame of boolian values


In short, I'm trying to translate a DataFrame like this

Patient   Cough   Headache   Dizzy
   1        1         0        0 
   2        1         1        1
   3        0         1        0 
   4        1         0        1
   5        0         1        0 

into a frequency distribution matrix similar to Pandas correlation feature.

That is to say, it would return something like this

        Cough   Headache   Dizzy
Cough     1       0.33     0.66
Headache 0.33       1      0.33
Dizzy     1       0.5       1

because 1 in 3 people with Headache were Dizzy, but only 1 in 2 people who were Dizzy had a Headache, etc.

The actual data I want to use it on is a lot bigger, so I was just curious if Pandas has a way to do this automatically.


Solution

  • Something like this?

    # extract columns of interest
    s = df.iloc[:,1:]
    
    # output
    ((s.T @ s)/s.sum()).T
    

    Output:

                 Cough  Headache     Dizzy
    Cough     1.000000  0.333333  0.666667
    Headache  0.333333  1.000000  0.333333
    Dizzy     1.000000  0.500000  1.000000