I have the following dataframe
{'state': {7192: 'healthy',
7193: 'healthy',
7194: 'healthy',
7195: 'Non healthy',
7196: 'Non healthy'},
'type': {7192: 'W', 7193: 'A', 7194: 'W', 7195: 'W', 7196: 'A'}}
I would like to have the joint probability associated with this df.
P(State = healthy, type = A), P(State = healthy, type = W) P(State = Non healthy, type = A), P(State = Non healthy, type = W)
I tried with the groupby method but it didn't work. What is the most efficient way to do it.
EDIT : To clarify a little bit I want to count the occurrence of every couple (State, Type). In the example above this should be P(State = healthy, type = A) = 1/5 , P(State = healthy, type = W) = 2/5 P(State = Non healthy, type = A) = 1/5, P(State = Non healthy, type = W) = 1/5
Thank you,
Seems like you can use DataFrame.value_counts(normalize=True)
to achieve what you want. Note that DataFrame.value_counts
is new to pandas
>= 1.1.0. If you're using an older version you can achieve the same result with a different method.
First transform your dictionary to a pd.DataFrame
:
df = pd.DataFrame(data)
Pandas version >= 1.1.0
probs = df.value_counts(["state", "type"], normalize=True)
print(probs)
healthy W 0.4
A 0.2
Non healthy W 0.2
A 0.2
# Select individual probabilitiy:
healthy_a_prob = probs[("healthy", "A")]
print(healthy_a_prob)
0.2
If your pandas is older than 1.1.0 replace the first line in the above example with:
probs = df.groupby("state")["type"].value_counts() / len(df)
# rest is the exact same
If you want a cross-tabulated probability table, I would recommend using pd.crosstab
with normalize=True
:
crosstab_ptable = pd.crosstab(df["state"], df["type"], normalize=True)
print(crosstab_ptable)
type A W
state
Non healthy 0.2 0.2
healthy 0.2 0.4
If you're interested in marginal probabilities as well, you can use the margins
argument:
crosstab_ptable = pd.crosstab(df["state"], df["type"], margins=True, normalize=True)
print(crosstab_ptable)
type A W All
state
Non healthy 0.2 0.2 0.4
healthy 0.2 0.4 0.6
All 0.4 0.6 1.0