Search code examples
pythonpandasdataframefilteringprobability

get joint probability from pd dataframe


I have the following dataframe

{'state': {7192: 'healthy',
  7193: 'healthy',
  7194: 'healthy',
  7195: 'Non healthy',
  7196: 'Non healthy'},
 'type': {7192: 'W', 7193: 'A', 7194: 'W', 7195: 'W', 7196: 'A'}}

I would like to have the joint probability associated with this df.

P(State = healthy, type = A), P(State = healthy, type = W)   P(State = Non healthy, type = A), P(State = Non healthy, type = W)

I tried with the groupby method but it didn't work. What is the most efficient way to do it.

EDIT : To clarify a little bit I want to count the occurrence of every couple (State, Type). In the example above this should be P(State = healthy, type = A) = 1/5 , P(State = healthy, type = W) = 2/5 P(State = Non healthy, type = A) = 1/5, P(State = Non healthy, type = W) = 1/5

Thank you,


Solution

  • Seems like you can use DataFrame.value_counts(normalize=True) to achieve what you want. Note that DataFrame.value_counts is new to pandas >= 1.1.0. If you're using an older version you can achieve the same result with a different method.

    First transform your dictionary to a pd.DataFrame:

    df = pd.DataFrame(data)
    

    Pandas version >= 1.1.0

    probs = df.value_counts(["state", "type"], normalize=True)
    
    print(probs)
    healthy      W       0.4
                 A       0.2
    Non healthy  W       0.2
                 A       0.2
    
    # Select individual probabilitiy:
    healthy_a_prob = probs[("healthy", "A")]
    
    print(healthy_a_prob)
    0.2
    

    If your pandas is older than 1.1.0 replace the first line in the above example with:

    probs = df.groupby("state")["type"].value_counts() / len(df)
    
    # rest is the exact same
    

    If you want a cross-tabulated probability table, I would recommend using pd.crosstab with normalize=True:

    crosstab_ptable = pd.crosstab(df["state"], df["type"], normalize=True)
    
    print(crosstab_ptable)
    type           A    W
    state
    Non healthy  0.2  0.2
    healthy      0.2  0.4
    

    If you're interested in marginal probabilities as well, you can use the margins argument:

    crosstab_ptable = pd.crosstab(df["state"], df["type"], margins=True, normalize=True)
    
    print(crosstab_ptable)
    type           A    W  All
    state
    Non healthy  0.2  0.2  0.4
    healthy      0.2  0.4  0.6
    All          0.4  0.6  1.0