Consider a two variable (Y1, Y2) problem, with each variable defined as follows:
How would we go about simulating a (complete) dataset of size 500 on (Y1, Y2)? This is what I wrote below:
n <- 500
y <- rnorm(n)
How would we simulate the corresponding observed dataset (by imposing missingness on Y2)? I'm not sure where to go with this question.
n <- 500
z1 <- rnorm(n)
z2 <- rnorm(n)
z3 <- rnorm(n)
y1 <- 1 + z1
y2 <- 5 + 2*z1 + z2
Display the marginal distribution of Y2 for the complete (as originally simulated) and observed (after imposing missingness) data.
Another way to display the distributions, in addition to the great explanation of @jay.sf is building the missing data mechanism in a new variable and compare both y2
and y2_missing
:
library(ggplot2)
library(dplyr)
library(tidyr)
set.seed(123)
#Data
n <- 500
#Random vars
z1 <- rnorm(n)
z2 <- rnorm(n)
z3 <- rnorm(n)
#Design Y1 and Y2
y1 <- 1+z1
y2 = 5 + 2*(z1) + z2
#For missing
y2_missing <- y2
#Set missing
index <- which(((2*(y1-1))+z3)<0)
y2_missing[index]<-NA
#Complete dataset
df <- data.frame(y1,y2,y2_missing)
#Plot distributions
df %>% select(-y1) %>%
pivot_longer(everything()) %>%
ggplot(aes(x=value,fill=name))+
geom_density(alpha=0.5)+
ggtitle('Distribution for y2 and y2_missing')+
labs(fill='Variable')+
theme_bw()
Output: