Search code examples
rsimulationdistributionmissing-data

Simulating Data on (Y1, Y2) where Y2 has missing values


Consider a two variable (Y1, Y2) problem, with each variable defined as follows:

  • Y1 = 1 + Z1, and Y1 is fully observed
  • Y2 = 5 + 2*(Z1) + Z2, and Y2 is missing if 2*(Y1 − 1) + Z3 < 0
  • Z1, Z2, and Z3 follow independent standard normal distributions.

How would we go about simulating a (complete) dataset of size 500 on (Y1, Y2)? This is what I wrote below:

    n <- 500
    y <- rnorm(n)

How would we simulate the corresponding observed dataset (by imposing missingness on Y2)? I'm not sure where to go with this question.

    n <- 500
    z1 <- rnorm(n)
    z2 <- rnorm(n)
    z3 <- rnorm(n)

    y1 <- 1 + z1
    y2 <- 5 + 2*z1 + z2

Display the marginal distribution of Y2 for the complete (as originally simulated) and observed (after imposing missingness) data.


Solution

  • Another way to display the distributions, in addition to the great explanation of @jay.sf is building the missing data mechanism in a new variable and compare both y2 and y2_missing:

    library(ggplot2)
    library(dplyr)
    library(tidyr)
    set.seed(123)
    #Data
    n <- 500
    #Random vars
    z1 <- rnorm(n)
    z2 <- rnorm(n)
    z3 <- rnorm(n)
    #Design Y1 and Y2
    y1 <- 1+z1
    y2 = 5 + 2*(z1) + z2
    #For missing
    y2_missing <- y2
    #Set missing
    index <- which(((2*(y1-1))+z3)<0)
    y2_missing[index]<-NA
    #Complete dataset
    df <- data.frame(y1,y2,y2_missing)
    #Plot distributions
    df %>% select(-y1) %>%
      pivot_longer(everything()) %>%
      ggplot(aes(x=value,fill=name))+
      geom_density(alpha=0.5)+
      ggtitle('Distribution for y2 and y2_missing')+
      labs(fill='Variable')+
      theme_bw()
    

    Output:

    enter image description here