Search code examples
rif-statementconditional-statements

Generating two series with a certain correlation and a specific condition in R


I want to generate two data series of size 100 in R, one of which is going to be remission time, tr, from Exp(mean=1) distribution and the other one is going to be survival time, t, from Exp(mean=2.5) distribution. I want them to be negatively correlated (say, the correlation is -0.5). But at the same time I want that R avoids the values of t[i] that are less than tr[i] for data point i, because survival times should be greater than remission times. I have been able to produce some correlation between the two variables (although the correlation is not exactly reproduced) using the following codes:

rho <- -0.5
mu <- rep(0,2)
Sigma <- matrix(rho, nrow=2, ncol=2) + diag(2)*(1 - rho)

library(MASS)

rawvars <- mvrnorm(100, mu=mu, Sigma=Sigma)

pvars <- pnorm(rawvars)
    
tr<-rep(0,100)
for(i in 1:100){
tr[i] <- qexp(pvars[,1][i], 1/1)
    }
    
t<-rep(0,100)
for(i in 1:100){
repeat { 
    t[i] <- qexp(pvars[,2][i], 1/2) 
    if (t[i]>tr[i]) break
}
}
    

cor(tr,t)

sum(tr>t) # shows number of invalid cases

But how should I efficiently induce the condition so that R only generates values of t that are greater than corresponding tr?

Moreover, is there a better way (faster way) to do the whole thing in R?


Solution

  • The issue here is that qexp is the quantile function and will return the same value for the same probability pvars[,2][i]. As a result, your code can easily go into an infinite loop when any one of the pvars[i,] is such that t[i]<=tr[i]. To avoid that, you must regenerate your rawvars for each t[i], tr[i] pair that fails your condition. In addition, looping over pvars is not necessary since qexp and operator > are all vectorized. The following code does what you want:

    rho <- -0.5
    mu <- rep(0,2)
    Sigma <- matrix(rho, nrow=2, ncol=2) + diag(2)*(1 - rho)
    
    library(MASS)
    set.seed(1)  ## so that results are repeatable
    
    compute.tr.t <- function(n, paccept) {
      n <- round(n / paccept)
      rawvars <- mvrnorm(n, mu=mu, Sigma=Sigma)
      pvars <- pnorm(rawvars)
      tr <- qexp(pvars[,1], 1/1)
      t <- qexp(pvars[,2], 1/2)
      keep <- which(t > tr)
      return(data.frame(t=t[keep],tr=tr[keep]))
    }
    
    n <- 10000  ## generating 10000 instead of 100, this can now be large
    paccept <- 1
    res <- data.frame()
    while (n > 0) {
      new.res <- compute.tr.t(n, paccept)
      res <- rbind(res, new.res)
      paccept <- nrow(new.res) / n
      n <- n - nrow(res)
    }
    

    Notes:

    1. The function compute.tr.t borrows a technique from rejection sampling here. Its input arguments are the requested number of samples that we want and the expected probability of acceptance. With this:

      • It generates n = n / paccept exponential variates for both tr and t as you do to account for the probability of acceptance
      • It only keeps those satisfying the condition t > tr.

      What compute.tr.t returns may be less than the requested n samples. We can then use this information to compute how many more samples we need and what the updated expected probability of acceptance is.

    2. We generate the samples satisfying our condition in a while loop. In this loop:

      • We call compute.tr.t with a requested number of samples to generate and the expected acceptance rate. Initially, these will be set to how many total samples we want and 1, respectively.
      • The result of compute.tr.t are then appended to the result data frame res.
      • Updating the probability of accept is simply the ratio of how many samples were returned over how many were requested.
      • Updating the requested number of samples is simply how many more we need from the total number we want.
      • We stop when the next requested number of samples is less than or equal to 0 (i.e., we have enough samples).

      The resulting data frame may contain more than the total number of samples we want.

    Running this code, we get:

    print(cor(res$tr,res$t))
    [1] -0.09128498
    print(sum(res$tr>res$t)) # shows number of invalid cases
    ##[1] 0
    

    We note that the anti correlation is significantly weaker than expected. This is due to your condition. If we remove this condition by modifying compute.tr.t as:

    compute.tr.t <- function(n, paccept) {
      n <- round(n / paccept)
      rawvars <- mvrnorm(n, mu=mu, Sigma=Sigma)
      pvars <- pnorm(rawvars)
      tr <- qexp(pvars[,1], 1/1)
      t <- qexp(pvars[,2], 1/2)
      return(data.frame(t=t,tr=tr))
    }
    

    Then we get:

    print(cor(res$tr,res$t))
    ##[1] -0.3814602
    print(sum(res$tr>res$t)) # shows number of invalid cases
    ##[1] 3676
    

    The correlation is now much more reasonable, but the number of invalid cases is significant.