I want to understand why I am not getting a probability distribution when I use a simulation from a random normal distribution:
df <- mtcars # data
df$sd <- sd(df$mpg) # standard deviation of the sample
f <- function(n1, s1, n2, s2){
mean(rnorm(10000, n1, s1) < rnorm(10000, n2, s2)) # function for probability distribution
g <- Vectorize(f, c("n1", "s1", "n2", "s2"))
res <- outer(df$mpg, df$sd, df$mpg, df$sd, FUN = g)
dimnames(res) <- list(row.names(df), row.names(df))
res <- data.frame(res)
res <- tibble::rownames_to_column(res, 'p1')
datalong_2 <- tidyr::gather(res, 'p2', 'value', 2:33) # output
I did this simulation but for some reason, I am not getting an actual probability distribution, my goal is to evaluate the probability of a car has less mpg than another car. But the sum of the probability does not add to one. I expect that this can be added to one or lower given that a tight might happens.
For example, the probability that Mazda Rx4
has a lower mpg than Mazda Rx4 wag
is 0.5094 while the probability that Mazda Rx4 wag
has a lower mpg than Mazda Rx4
is 0.5029, the sum of this probability is 1.0123. How can I change this code to get an actual probability distribution of one car has lower mpg than another car?
Unless you absolutely have to run simulations, you can use the pnorm()
function to calculate the probabilities precisely.
We assume that X~N(u1,s1)
and Y~N(u2,s2)
where s1
and s2
are variances.
Also we know that P(X<Y) = P(X-Y<0)
, where X-Y ~ N(u1-u2,s1+s2)
. From this, we can calculate the probabilities precisely:
df <- mtcars # data
df$sd <- sd(df$mpg) # standard deviation of the sample
f <- function(n1, n2){
pnorm(0, mean = n1 - n2, sd = sqrt(2*df$sd^2))
res <- outer(X = df$mpg, Y = df$mpg, FUN = f)
dimnames(res) <- list(row.names(df), row.names(df))
res <- data.frame(res)
res <- tibble::rownames_to_column(res, 'p1')
datalong_2 <- tidyr::gather(res, 'p2', 'value', 2:33) # output
> datalong_2
p1 p2 value
1 Mazda RX4 Mazda.RX4 0.50000000
2 Mazda RX4 Wag Mazda.RX4 0.50000000
3 Datsun 710 Mazda.RX4 0.41637203
4 Hornet 4 Drive Mazda.RX4 0.48128464
5 Hornet Sportabout Mazda.RX4 0.60636049
.. .. .. ..
Also, I think your main problem was in the function outer()
, which requires 2 inputs X
and Y
. It worked for me once I changed it.
Edits 2 & 3:
df1 <- mtcars; df1$rownames = rownames(df1)
df2 <- mtcars; df2$rownames = rownames(df2)
df2$mpg = df2$mpg + rnorm(nrow(df2),0,3)
data = rbind(df1, df2)
df = ddply(data,~rownames,summarise,mean=mean(mpg),sd=sd(mpg))
df = rbind(df, c("car1",-1.02, 2.66))
df = rbind(df, c("car2",0.13, 0.06))
df$mean <- as.numeric(df$mean)
df$sd <- as.numeric(df$sd)
f <- function(x, y){
n1 = df$mean[x]; n2 = df$mean[y]; sd1 = df$sd[x]; sd2 = df$sd[y]
pnorm(0, mean = n1 - n2, sd = sqrt(sd1^2 + sd2^2))
res <- outer(X = 1:nrow(df), Y = 1:nrow(df), f)
dimnames(res) <- list(df$rownames, df$rownames)
res <- data.frame(res)
res <- tibble::rownames_to_column(res, 'p1')
datalong_2 <- tidyr::gather(res, 'p2', 'value', -1) # output
subset(datalong_2, p1 %in% c("car1","car2") & p2 %in% c("car1","car2"))
> subset(datalong_2, p1 %in% c("car1","car2") & p2 %in% c("car1","car2"))
p1 p2 value
1121 car1 car1 0.5000000
1122 car2 car1 0.3327904
1155 car1 car2 0.6672096
1156 car2 car2 0.5000000