I have two datasets that contain people's height (among other details) stored in data frames:
dataset1$height
dataset2$height
I need to know the probability of a randomly selected height from dataset1 will be larger than a randomly selected height from dataset2.
I'm aware I'm trying to calculate P(X-Y)>0
here, but I'm not sure how to put this in R, and how to get an answer for P (i.e. what is the probability that dataset1$height - dataset2$height > 0
).
I've calculated the mean, variance and standard deviation for the height in each dataset, I'm just not sure how to fit these calculations into a formula that results in the probability.
For reproducibility, the following samples could represent the what I have:
dataset1 = rnorm(100, mean = 11, sd = 2)
dataset2 = rnorm(100, mean = 10, sd = 2)
mean1 = mean(dataset1)
mean2 = mean(dataset2)
var1 = var(dataset1)
var2 = var(dataset2)
sdev1 = sd(dataset1)
sdev2 = sd(dataset2)
Probability = mean(dataset2)/mean(dataset1)
Probability
From the way you phrased your question, dataset1
and dataset2
are the considered the populations from which your RV is drawn. In that case, Montgomery is correct that you could average all pairwise comparisons between dataset1
and dataset2
:
mean(outer(dataset1, dataset2, ">"))
Although, a more efficient method for larger datasets would be to use the ordering of the concatenated vectors:
d12 <- order(c(dataset1, dataset2))
sum(cumsum(d12 > length(dataset1))[d12 <= length(dataset1)])/length(dataset1)/length(dataset2)
If, on the other hand, dataset1
and dataset2
represent samples from parent distributions, and you want to know the probability that a randomly selected height from X ~ distribution1 will be larger than a randomly selected height from Y ~ distribution2, it would depend on your knowledge/assumptions about the underlying distributions. For example, if they were both independent normal distributions with known means mu1
, mu2
and standard deviations sigma1
, sigma2
, then the distribution of X - Y would be normal with mean mu1 - mu2
and standard deviation sqrt(sigma1^2 + sigma2^2)
. P(X > Y) would be:
pnorm(0, mean = mu1 - mu2, sd = sqrt(sigma1^2 + sigma2^2), lower.tail = FALSE)
Using larger vectors with your example parameters, the two approaches give similar answers:
> set.seed(94)
> mu1 <- 11; mu2 <- 10; sigma1 <- 2; sigma2 <- 2
> dataset1 = rnorm(1e6, mean = mu1, sd = sigma1)
> dataset2 = rnorm(1e6, mean = mu2, sd = sigma2)
> d12 <- order(c(dataset1, dataset2)); sum(cumsum(d12 > length(dataset1))[d12 <= length(dataset1)])/length(dataset1)/length(dataset2)
[1] 0.6384645
> pnorm(0, mean = mu1 - mu2, sd = sqrt(sigma1^2 + sigma2^2), lower.tail = FALSE)
[1] 0.6381632