Search code examples
pythonprobabilityprobability-theory

Calculating Conditional Probabilities from frequencies in Python


I am trying to calculate the conditional probabilities for P(A=a|B=b,C=c) where a is an element in ['high', 'medium', 'low'], b is an element in ['0-20', '20-40', '40-60', '60-80', '80-inf'] and c is an element in ['male', 'female'].

I have a dictionary with the frequencies that looks like this:

{('high', '0-20', 'female'): 11,
 ('high', '0-20', 'male'): 43,
 ('high', '20-40', 'female'): 10,
 ('high', '20-40', 'male'): 17,
 ('high', '40-60', 'female'): 11,
 ('high', '40-60', 'male'): 10,
 ('high', '60-80', 'female'): 2,
 ('high', '60-80', 'male'): 1,
 ('high', '80-inf', 'female'): 0,
 ('high', '80-inf', 'male'): 0,
 ('low', '0-20', 'female'): 130,
 ('low', '0-20', 'male'): 159,
 ('low', '20-40', 'female'): 186,
 ('low', '20-40', 'male'): 297,
 ('low', '40-60', 'female'): 71,
 ('low', '40-60', 'male'): 144,
 ('low', '60-80', 'female'): 35,
 ('low', '60-80', 'male'): 53,
 ('low', '80-inf', 'female'): 1,
 ('low', '80-inf', 'male'): 2,
 ('medium', '0-20', 'female'): 90,
 ('medium', '0-20', 'male'): 194,
 ('medium', '20-40', 'female'): 72,
 ('medium', '20-40', 'male'): 116,
 ('medium', '40-60', 'female'): 46,
 ('medium', '40-60', 'male'): 49,
 ('medium', '60-80', 'female'): 12,
 ('medium', '60-80', 'male'): 22,
 ('medium', '80-inf', 'female'): 1,
 ('medium', '80-inf', 'male'): 2}

What I want is a dictionary that looks like:

{('high', '0-20', 'female'): P(A='high'| B='0-20', C='female'),
 etc...,
}

Solution

  • So, if I'm understanding your comment correctly, what you are having trouble with is the concept of calculating the conditional probability when there are two or more "conditions" as opposed to a single condition.

    It's been quite a while since I last took a probability/statistics class, but I think what you need to do is break this down into separate problems. From the data, you can easily calculate your P(B=b) and P(C=c). What you need next is the joint probability that B=b AND C=c, which you should also be able to get directly from the data - e.g. P(high, 0-20) is just the sum of all the points that match both conditions divided by the total count. If you call this joint probability P(X), then, it should be fairly straightforward from the definition of conditional probability to calculate P(A=a|X) = P(A=a ∩ X) / P(X).

    It might be a good idea to repost this or migrate it to the Math SE site, though, to get confirmation and/or a better answer...