Search code examples
rconditional-statementsprobability

Conditional probability experiment in R


Here is my code

library(dplyr)

rain_vector <- sample(c(0,1), 1000000, replace = T, prob= c(0.2,0.8))

for(el in 1:10){

df <- data.frame(rain = rain_vector )
df <- df %>% mutate(A= if_else(rain == 1, sample(c(0,1), 1, replace = T, prob= c(1/3,2/3)),
                          sample(c(0,1), 1, replace = T, prob= c(2/3,1/3))))

print(NROW(df[df$A==1,]))
print(NROW(df[df$A == 1 & df$rain == 1, ]))
print(NROW(df[df$rain == 1,]))
print("______________")

}

Here is the output:

[1] 0
[1] 0
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 800325
[1] 800325
[1] 800325
[1] "______________"
[1] 1000000
[1] 800325
[1] 800325
[1] "______________"
[1] 1000000
[1] 800325
[1] 800325
[1] "______________"
[1] 1000000
[1] 800325
[1] 800325
[1] "______________"

None of the results makes sense to me. Let us look at the last one. Case A = 1 is happening always, while it is supposed to happen with probability 1/3 and 2/3 depending on rain. Is there something about dplyr package. Any suggestions?


Solution

  • The problem is that sample(c(0,1), 1, replace = T, prob= c(1/3,2/3)) has length 1, therefore it's repeating the value for each row.

    Instead you could use rowwise() before your mutate call, so you explicitly tell that there should be a sample() call for each row.

      df <- df %>% 
        rowwise() %>%
        mutate(A= if_else(rain == 1, sample(c(0,1), 1, replace = T, prob= c(1/3,2/3)),
                                     sample(c(0,1), 1, replace = T, prob= c(2/3,1/3))))
    

    Another faster option is to use base R and compute a single sample with appropriate length for each value of the two values of rain, reducing the number of calls to sample() from 1000000 to 2:

    rain_vector <- sample(c(0,1), 1000000, replace = T, prob= c(0.2,0.8))
    
    for(el in 1:10){
    
      df <- data.frame(rain = rain_vector, A = numeric(length(rain_vector)))
      df[rain_vector == 1, "A"] <- sample(c(0,1), sum(rain_vector==1), prob= c(1/3,2/3), replace = T)
      df[rain_vector == 0, "A"] <- sample(c(0,1), sum(rain_vector==0), prob= c(2/3,1/3), replace = T)
    
      print(NROW(df[df$A==1,]))
      print(NROW(df[df$A == 1 & df$rain == 1, ]))
      print(NROW(df[df$rain == 1,]))
      print("______________")
    
    }