Search code examples
rprobability

Probability of random variable from group A being larger than group B


I have two datasets that contain people's height (among other details) stored in data frames:

dataset1$height
dataset2$height

I need to know the probability of a randomly selected height from dataset1 will be larger than a randomly selected height from dataset2.

I'm aware I'm trying to calculate P(X-Y)>0 here, but I'm not sure how to put this in R, and how to get an answer for P (i.e. what is the probability that dataset1$height - dataset2$height > 0).

I've calculated the mean, variance and standard deviation for the height in each dataset, I'm just not sure how to fit these calculations into a formula that results in the probability.

For reproducibility, the following samples could represent the what I have:

dataset1 = rnorm(100, mean = 11, sd = 2)
dataset2 = rnorm(100, mean = 10, sd = 2)
mean1 = mean(dataset1)
mean2 = mean(dataset2)
var1 = var(dataset1)
var2 = var(dataset2)
sdev1 = sd(dataset1)
sdev2 = sd(dataset2)
Probability = mean(dataset2)/mean(dataset1)
Probability

Solution

  • From the way you phrased your question, dataset1 and dataset2 are the considered the populations from which your RV is drawn. In that case, Montgomery is correct that you could average all pairwise comparisons between dataset1 and dataset2:

    mean(outer(dataset1, dataset2, ">"))
    

    Although, a more efficient method for larger datasets would be to use the ordering of the concatenated vectors:

    d12 <- order(c(dataset1, dataset2))
    sum(cumsum(d12 > length(dataset1))[d12 <= length(dataset1)])/length(dataset1)/length(dataset2)
    

    If, on the other hand, dataset1 and dataset2 represent samples from parent distributions, and you want to know the probability that a randomly selected height from X ~ distribution1 will be larger than a randomly selected height from Y ~ distribution2, it would depend on your knowledge/assumptions about the underlying distributions. For example, if they were both independent normal distributions with known means mu1, mu2 and standard deviations sigma1, sigma2, then the distribution of X - Y would be normal with mean mu1 - mu2 and standard deviation sqrt(sigma1^2 + sigma2^2). P(X > Y) would be:

    pnorm(0, mean = mu1 - mu2, sd = sqrt(sigma1^2 + sigma2^2), lower.tail = FALSE)
    

    Using larger vectors with your example parameters, the two approaches give similar answers:

    > set.seed(94)
    > mu1 <- 11; mu2 <- 10; sigma1 <- 2; sigma2 <- 2
    > dataset1 = rnorm(1e6, mean = mu1, sd = sigma1)
    > dataset2 = rnorm(1e6, mean = mu2, sd = sigma2)
    > d12 <- order(c(dataset1, dataset2)); sum(cumsum(d12 > length(dataset1))[d12 <= length(dataset1)])/length(dataset1)/length(dataset2)
    [1] 0.6384645
    > pnorm(0, mean = mu1 - mu2, sd = sqrt(sigma1^2 + sigma2^2), lower.tail = FALSE)
    [1] 0.6381632