Search code examples
rprobability

Conditional probability in r


The question:

A screening test for a disease, that affects 0.05% of the male population, is able to identify the disease in 90% of the cases where an individual actually has the disease. The test however generates 1% false positives (gives a positive reading when the individual does not have the disease). Find the probability that a man has the disease given that has tested positive. Then, find the probability that a man has the disease given that he has a negative test.

My wrong attempt:

I first started by letting: • T be the event that a man has a positive test • Tc be the event that a man has a negative test • D be the event that a man has actually the disease • Dc be the event that a man does not have the disease

Therefore we need to find P(D|T) and P(D|Tc)

Then I wrote this code:

set.seed(110)
sims = 1000

D = rep(0, sims)
Dc = rep(0, sims)
T = rep(0, sims)
Tc = rep(0, sims)

# run the loop
for(i in 1:sims){
  
  # flip to see if we have the disease
  flip = runif(1)
  
  # if we got the disease, mark it
  if(flip <= .0005){
    D[i] = 1
  }
  
  # if we have the disease, we need to flip for T and Tc, 
  if(D[i] == 1){
    
    # flip for S1
    flip1 = runif(1)
    
    # see if we got S1
    if(flip1 < 1/9){
      T[i] = 1
    }
    
    # flip for S2
    flip2 = runif(1)
    
    # see if we got S1
    if(flip2 < 1/10){
      Tc[i] = 1
    }
  }
}


# P(D|T)
mean(D[T == 1])

# P(D|Tc)
mean(D[Tc == 1])

I'm really struggling so any help would be appreciated!


Solution

  • Perhaps the best way to think through a conditional probability question like this is with a concrete example.

    Say we tested one million individuals in the population. Then 500 individuals (0.05% of one million) would be expected to have the disease, of whom 450 would be expected to test positive and 50 to test negative (since the false negative rate is 10%).

    Conversely, 999,500 would be expected to not have the disease (one million minus the 500 who do have the disease), but since 1% of them would test positive, then we would expect 9,995 people (1% of 999,500) with false positive results.

    So, given a positive test result taken at random, it either belongs to one of the 450 people with the disease who tested positive, or one of the 9,995 people without the disease who tested positive - we don't know which.

    This is the situation in the first question, since we have a positive test result but don't know whether it is a true positive or a false positive. The probability of our subject having the disease given their positive test is the probability that they are one of the 450 true positives out of the 10,445 people with positive results (9995 false positives + 450 true positives). This boils down to the simple calculation 450/10,445 or 0.043, which is 4.3%.

    Similarly, a negative test taken at random either belongs to one of the 989505 (999500 - 9995) people without the disease who tested negative, or one of the 50 people with the disease who tested negative, so the probability of having the disease is 50/989505, or 0.005%.

    I think this question is demonstrating the importance of knowing that disease prevalence needs to be taken into account when interpreting test results, and very little to do with programming, or R. It requires only a calculator (at most).

    If you really wanted to run a simulation in R, you could do:

    set.seed(1) # This makes the sample reproducible
    
    sample_size <- 1000000 # This can be changed to get a larger or smaller sample
    
    # Create a large sample of 1 million "people", using a 1 to denote disease and
    # a 0 to denote no disease, with probabilities of 0.0005 (which is 0.05%) and
    # 0.9995 (which is 99.95%) respectively.
    disease <- sample(x = c(0, 1), 
                      size = sample_size, 
                      replace = TRUE, 
                      prob = c(0.9995, 0.0005))
    
    # Create an empty vector to hold the test results for each person
    test <- numeric(sample_size)
    
    # Simulate the test results of people with the disease, using a 1 to denote
    # a positive test and 0 to denote a negative test. This uses a probability of
    # 0.9 (which is 90%) of having a positive test and 0.1 (which is 10%) of having
    # a negative test. We draw as many samples as we have people with the disease
    # and put them into the "test" vector at the locations corresponding to the
    # people with the disease.
    test[disease == 1] <- sample(x = c(0, 1), 
                                 size = sum(disease), 
                                 replace = TRUE, 
                                 prob = c(0.1, 0.9))
    
    # Now we do the same for people without the disease, simulating their test
    # results, with a 1% probability of a positive test.
    test[disease == 0] <- sample(x = c(0, 1), 
                                 size = 1e6 - sum(disease), 
                                 replace = TRUE, 
                                 prob = c(0.99, 0.01))
    

    Now we have run our simulation, we can just count the true positives, false positives, true negatives and false negatives by creating a contingency table

    contingency_table <- table(disease, test)
    
    contingency_table
    #>        test
    #> disease      0      1
    #>       0 989566   9976
    #>       1     38    420
    

    and get the approximate probability of having the disease given a positive test like this:

    contingency_table[2, 2] / sum(contingency_table[,2])
    #> [1] 0.04040015
    

    and the probability of having the disease given a negative test like this:

    contingency_table[2, 1] / sum(contingency_table[,1])
    #> [1] 3.83992e-05
    

    You'll notice that the probability estimates from sampling are not that accurate because of how small some of the sampling probabilities are. You could simulate a larger sample, but it might take a while for your computer to run it.

    Created on 2021-08-19 by the reprex package (v2.0.0)