I am trying to create a sample data set (most of the code is from this question). It is almost how I want it to be. However, there are two things I still want to do, but I cannot figure out.
I would like to create a higher correlation between y
and year
, without rearranging the whole data set (so by only changing the values of y).
If possible (I currently just manually changed the set.seed()
until I got a significant relation), I would like to be able to determine the true correlation between the event
and y
. (again only y can be changed).
Could someone help me with explaining how to do this?
set.seed(2)
a <- 2 # structural parameter of interest
b <- 1 # strength of instrument
rho <- 0.5 # degree of endogeneity
N <- 1000
z <- rnorm(N)
res1 <- rnorm(N)
res2 <- res1*rho + sqrt(1-rho*rho)*rnorm(N)
x <- z*b + res1
ys <- x*a + res2
d <- (ys>0) #dummy variable
y <- round(10-(d*ys))
random_variable <- rnorm(100, mean = 0, sd = 1)
library(data.table)
DT_1 <- data.frame(y,x,z, random_variable)
DT_2 <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
45, 46, 47, 48, 49, 50), year = c(1995, 1995, 1995, 1995, 1995,
1995, 1995, 1995, 1995, 1995, 2000, 2000, 2000, 2000, 2000, 2000,
2000, 2000, 2000, 2000, 2005, 2005, 2005, 2005, 2005, 2005, 2005,
2005, 2005, 2005, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010,
2010, 2010, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,
2015), Group = c("A", "A", "A", "A", "B", "B", "B", "B", "C",
"C", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "A", "A",
"A", "A", "B", "B", "B", "B", "C", "C", "A", "A", "A", "A", "B",
"B", "B", "B", "C", "C", "A", "A", "A", "A", "B", "B", "B", "B",
"C", "C"), event = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), win_or_lose = c(-1,
-1, -1, -1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, -1, -1, -1, -1, 1, 1, 1, 1, 0, 0,
-1, -1, -1, -1, 1, 1, 1, 1, 0, 0)), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
DT_1 <- setDT(DT_1)
DT_2 <- setDT(DT_2)
DT_2 <- rbind(DT_2 , DT_2 [rep(1:50, 19), ])
sandbox <- cbind(DT_1, DT_2)
This approach uses the following idea:
y
that depends on year
. This boosts the correlation a lot. Here, it depends on beta
. You can tweak beta
to increase the influence of year
. Note that I worked with year-mean(year)
so that the overall scale of y
is not shifted too much. If you don't care about y
being shifted, just drop the mean-part.y
. You can tweak the sd
parameter to increase the noise, thus decrease the correlation.I save the result in y2
so that you can play around more easily. When you're satisfied with parameters beta
and sd
, you can just overwrite y
.
noise = rnorm(n = nrow(sandbox), mean = 0, sd = 0.01)
beta = 0.1
sandbox$y2 = sandbox$y + beta * (sandbox$year - mean(sandbox$year)) + noise
cor(sandbox$y2, sandbox$year)
Good luck and please provide feedback or clarification if this is not the desired behavior.
EDIT:
Here you can see the behavior of different beta
and sigma
values:
betas = seq(-.50, .50, by=.10)
sigmas = seq(0.0, 5.0, by=1.0)
M = matrix(data=NA, nrow=length(betas), ncol=length(sigmas))
for (b in 1:length(betas)){
for (s in 1:length(sigmas)){
noise = rnorm(n = nrow(sandbox), mean = 0, sd = sigmas[s])
sandbox$y2 = sandbox$y + betas[b] * (sandbox$year - mean(sandbox$year)) + noise
M[b,s] = round(cor(sandbox$y2, sandbox$year), 2)
}
}
rownames(M) = betas
colnames(M) = sigmas
M
resulting in the following matrix output. Rows are beta, columns are sigma, cell value is the correlation of y
and year
:
0 1 2 3 4 5
-0.5 -0.86 -0.84 -0.77 -0.66 -0.62 -0.55
-0.4 -0.81 -0.78 -0.70 -0.61 -0.53 -0.47
-0.3 -0.71 -0.68 -0.61 -0.51 -0.45 -0.42
-0.2 -0.56 -0.51 -0.46 -0.32 -0.29 -0.25
-0.1 -0.32 -0.29 -0.25 -0.22 -0.12 -0.08
0 0.01 -0.01 0.00 -0.01 0.01 -0.01
0.1 0.33 0.31 0.24 0.21 0.19 0.12
0.2 0.57 0.52 0.45 0.38 0.33 0.27
0.3 0.72 0.66 0.59 0.48 0.44 0.33
0.4 0.81 0.78 0.71 0.62 0.54 0.48
0.5 0.86 0.84 0.78 0.69 0.61 0.53
EDIT 2: Of course, you can simply have a negative beta
to achieve negative correlations. You might also just fix sigma
and only adjust beta
.