I'm trying to simulate a wider population from a small one in R as follows:
idata <- subset(data, select=c(WT, AGE, HT, BFP, SEX) )
M= cor(idata)
mu <- sapply(idata, mean)
sd <- sapply(idata, stdev)
sigma=cor2cov(M, sd)
simulation <- as.data.frame(mvrnorm(1000, mu, sigma))
But the problems is, for SEX, the code will consider a continuous distribution, while it has to be binary, and effects of sex has to be either fully considered (SEX==1), or not at all (SEX==0). I'd appreciate any help with this regard. Thanks
What you should do is consider that your data consists of two sub-populations, and then draw data from them, based on their proportions.
So, first estimate the proportions, pi_m
and pi_f (= 1 - pi_m)
, which are the proportion of SEX == 0 and SEX == 1. This should be something like
pi_m = sum(idata$SEX == 1)/ nrow(idata)
Then estimate parameters for the two populations, mu_f
, mu_m
, sigma_f
and sigma_m
, which are mean and covariance parameters for the two SEX populations (now without the SEX variable).
The first draw a random number r <- runif(1)
, if this is less than equal to pi_m then generate a sample from N(mu_m, sigma_s)
else from N(mu_f, sigma_f)
.
You can do this step 1000 times to get 1000 samples from your distribution.
Of course, you can vector this, by first generating 1000 samples from runif
. For example
n_m <- sum(runif(1000) <= pi_m)
n_f <- 1000 - n_m
X_m <- rmvnorm(n_m, mu_m, sigma_m)
X_f <- rmvnorm(n_f, mu_f, sigma_f)
X <- rbind(X_m, X_f)