Search code examples
rsimulationcovariance

Sampling from a multivariate distribution including gender in R


I'm trying to simulate a wider population from a small one in R as follows:

idata <- subset(data, select=c(WT, AGE, HT, BFP, SEX) )
M= cor(idata)
mu <- sapply(idata, mean)
sd <- sapply(idata, stdev)
sigma=cor2cov(M, sd)
simulation <- as.data.frame(mvrnorm(1000, mu, sigma))

But the problems is, for SEX, the code will consider a continuous distribution, while it has to be binary, and effects of sex has to be either fully considered (SEX==1), or not at all (SEX==0). I'd appreciate any help with this regard. Thanks


Solution

  • What you should do is consider that your data consists of two sub-populations, and then draw data from them, based on their proportions.

    So, first estimate the proportions, pi_m and pi_f (= 1 - pi_m), which are the proportion of SEX == 0 and SEX == 1. This should be something like pi_m = sum(idata$SEX == 1)/ nrow(idata)

    Then estimate parameters for the two populations, mu_f, mu_m, sigma_f and sigma_m, which are mean and covariance parameters for the two SEX populations (now without the SEX variable).

    The first draw a random number r <- runif(1), if this is less than equal to pi_m then generate a sample from N(mu_m, sigma_s) else from N(mu_f, sigma_f).

    You can do this step 1000 times to get 1000 samples from your distribution.

    Of course, you can vector this, by first generating 1000 samples from runif. For example

    n_m <- sum(runif(1000) <= pi_m)
    n_f <- 1000 - n_m
    
    X_m <- rmvnorm(n_m, mu_m, sigma_m)
    X_f <- rmvnorm(n_f, mu_f, sigma_f)
    
    X <- rbind(X_m, X_f)