I would like to create a joint probability distribution by combining two dataframes. Each dataframe contains data drawn from the same population, but the data is not matched. For the sake of providing workable code, imagine that the data is as follows:
v1 <- data.frame(rnorm(100, 0, 3))
v2 <- data.frame(rnorm(30, 10, 20))
In reality I have survey data and simulation data that does not follow a pre-set probability distribution. I am looking for a solution that can combine two vectors of different lengths to create a joint probability distribution.
Dataset v1 represents the distribution of financial returns that can be earned by installing solar panels.
Dataset v2 represents the financial return threshold for households interested in installing solar. A household will only install solar if they live in a home that would meet the threshold they have set in terms of financial return.
Given these two datasets, I'd like to use the joint probability distribution to estimate the likely proportion of households that will adopt and install solar panels.
I've considered running a monte carlo exercise where I would randomly draw from v1 and match it with a draw from v2. I would repeat the process 1000 times and see how many homes would have achieved a return greater than their threshold.
library(tidyverse)
set.seed(1234)
monte = NULL
for (i in 1:1000)
{dat = data.frame()
draw1 <- sample_n(v1, 1)
draw2 <- sample_n(v2, 1)
dat = data.frame(draw1,draw2)
monte = rbind(monte, dat)
}
colnames(monte) <- c("return","threshold")
adoption <- monte %>%
mutate(total = n()) %>%
filter(return > threshold) %>%
summarize(count = n(),
total=mean(total)) %>%
mutate(adoption = count/total)
This could work, but I am wondering if there is an alternate way to combine these vectors into a joint probability distribution using R. I would like to be able to generate summary statistics (e.g. proportion of households that would achieve a net return greater than their required threshold), and also visualize the joint distribution in 2-dimensional space.
The question inherently does not make sense - if the data is not matched you cannot visualize the sampling distribution.
The Monte Carlo exercise you've put together is something akin to a permutation + bootstrap procedure, where you are trying to test against a null hypothesis that there is no relationship between the two variables.
It is not possible to directly calculate a "joint distribution" - the best you can do is simulate draws from the null hypothesis, and conduct subsequent inference. E.g. is the proportion larger than say, 0.5. That is, unless you are willing to go Bayesian.
If you wish to visualize the null distribution (or any joint distribution in general), a scatter or contour plot as usual would work.
monte |>
ggplot() +
geom_density_2d(aes(x = return, y = threshold))
monte |>
ggplot() +
geom_point(aes(x = return, y = threshold))