Search code examples
rprobability

Using prob package to calculate a conditional probability in R


My data looks like this:

d

#> # A tibble: 220 x 2
#>    smoker pain 
#>    <chr>  <chr>
#>  1 Smoker Pain 
#>  2 Smoker Pain 
#>  3 Smoker Pain 
#>  4 Smoker Pain 
#>  5 Smoker Pain 
#>  6 Smoker Pain 
#>  7 Smoker Pain 
#>  8 Smoker Pain 
#>  9 Smoker Pain 
#> 10 Smoker Pain 
#> # … with 210 more rows

Is a combination between two variables: smokers and pain.

d %>% 
  count(smoker, pain, sort = T)
#> # A tibble: 4 x 3
#>   smoker    pain        n
#>   <chr>     <chr>   <int>
#> 1 No smoker No pain   107
#> 2 Smoker    Pain       70
#> 3 Smoker    No pain    35
#> 4 No smoker Pain        8

I want to calculate the probability of a person feeling pain given he is a smoker P(pain|smoker):

library(tidyverse)
library(prob)

d <- probspace(d)
Prob(d, event = smoker == "Smoker", given = pain == "Pain")
#> [1] 0.01282051

As far as I know this value must be the percentage of smokers that feel pain:

70/105

#> [1] 0.667

What is wrong here?

This is the code for the data:

smoker <- c(rep("Smoker", 105), rep("No smoker", 115))
pain <- c(rep("Pain", 70), rep("No pain", 35), rep("Pain", 8), rep("No pain", 107))

d <- tibble(smoker, pain)

Solution

  • I think you should add one more line d <- cbind(id = seq(nrow(d)),d) after d <- tibble(smoker, pain), i.e.,

    d <- tibble(smoker, pain)
    d <- cbind(id = seq(nrow(d)),d)
    

    then you will get the desired result

    > Prob(d, event = pain == "Pain", given = smoker == "Smoker")
    [1] 0.6666667
    

    NOTE: The reason behind of doing this is that, Prob() calculates the intersect() between event and given condition. When you are using data frames for the probability space, the duplicates in the intersection will be dropped. To avoid that, you need to manually add extra information to distinguish rows in the data frame d, such that all duplicates can be saved till the end of calculation.