Search code examples
rdataframetime-seriessimulationpanel

How to make a normally distributed variable depend on entries and time in R?


I'm trying to generate a dataset of cross sectional time series to estimate uses of different models. In this dataset, I have a ID variable and time variable. I'm trying to add a normally distributed variable that depends on the two identifications. In other words, how do I create a variable that recongizes both ID and time in R? If my question appears uncertain, feel free to ask any questions. Thanks in advance.

df2 <- read.table(
text =
"Year,ID,H,
1,1,N(2.3),
2,1,N(2.3),
3,1,N(2.3),
1,2,N(0.1),
2,2,N(0.1),
3,2,N(0.1),
", sep = ",", header = TRUE)

Solution

  • Assuming that the data in the dataframe df looks like

    ID Time
    1 1
    1 2
    1 3
    1 4
    2 1
    2 2
    2 3
    2 4
    3 1
    3 2
    3 3
    3 4

    you can generate a variable y that depends on ID and time as the sum of two random normal distributions (yielding another normal distribution) that depend on ID and time respectively:

    set.seed(42)
    
    
    df = data.frame(
      ID   = rep(1:4,   each=3),
      time = rep(1:3,   times=4)
    )
    
    df$y = rnorm(nrow(df), mean=df$ID,   sd=1+0.1*df$ID) + 
           rnorm(nrow(df), mean=df$time, sd=0.05*df$time)
    
    # Output:
       ID time         y
    1   1    1  3.438611
    2   1    2  2.350953
    3   1    3  4.379443
    4   1    4  5.823339
    5   2    1  3.470909
    6   2    2  3.607005
    7   2    3  6.447756
    8   2    4  6.150432
    9   3    1  6.608619
    10  3    2  4.740341
    11  3    3  7.670543
    12  3    4 10.215574
    
    

    Note that the underlying normal distributions depend on both ID and time. That is in contrast to your example table above where it looks like it solely depends on ID -- namely resulting in a single normal distribution per ID that is independent of the time variable.