Search code examples
rdataframedatedplyrmissing-data

How do you populate missing dates for lag?


Say that I have a dataset.

date <- c("2004-02-01", "2004-03-05", "2004-08-09", "2004-08-13", "2004-10-20", "2004-11-02", "2008-01-05", "2008-02-03", "2008-08-09", "2008-11-04", "2012-01-05", "2012-02-03", "2012-08-09", "2012-10-04", "2012-10-04", "2012-10-31", "2012-11-04")

date <- ymd(date)

name <- c("Joe", "Joe", "Joe", "Joe", "Joe", "Joe",
          "Larry", "Larry", "Larry", "Larry",
          "Jeff", "Jeff", "Jeff", "Jeff", "Jeff", "Jeff", "Jeff")

hits <- c(5, 4, 10, 9, 15, 1,
          13, 22, 9, 11,
          15, 17, 10, 3, 4, 2, 33)

df <- data.frame(date, name, hits)

I want to do 7-day time lags for each observation. In order to do this, I will have to restructure the dataset a bit.

I want to add seven days after each date for each name, but the hits will be 0. I hope to end up with a dataset like the following (for Joe):

date  name hits
2004-02-01   Joe    5
2004-02-02   Joe    0
2004-02-03   Joe    0
2004-02-04   Joe    0
2004-02-05   Joe    0
2004-02-06   Joe    0
2004-02-07   Joe    0
2004-02-08   Joe    0
2004-03-05   Joe    4
2004-03-06   Joe    0
2004-03-07   Joe    0
2004-03-08   Joe    0
2004-03-09   Joe    0
2004-03-10   Joe    0
2004-03-11   Joe    0
2004-03-12   Joe    0
2004-08-09   Joe   10
2004-08-10   Joe   0
2004-08-11   Joe   0
2004-08-12   Joe   0
2004-08-13   Joe   9
2004-08-14   Joe   0
2004-08-15   Joe   0
2004-08-16   Joe   0
2004-08-17   Joe   0
2004-08-18   Joe   0
2004-08-19   Joe   0
2004-08-20   Joe   0
2004-10-20   Joe    15
2004-10-21   Joe    0
2004-10-22   Joe    0
2004-10-23   Joe    0
2004-10-24   Joe    0
2004-10-25   Joe    0
2004-10-26   Joe    0
2004-10-27   Joe    0
2004-11-02   Joe    1
2004-11-03   Joe    0
2004-11-04   Joe    0
2004-11-05   Joe    0
2004-11-06   Joe    0
2004-11-07   Joe    0
2004-11-08   Joe    0
2004-11-09   Joe    0

Is there a fast way to do this using dplyr?


Solution

  • Update for dplyr 1.1.1

    Use reframe instead of summarise to return multiple rows per group:

    library(tidyverse)
    
    ndays=7
    
    df.filled2 = df %>% 
      mutate(date = as.Date(date)) %>% 
      group_by(name) %>% 
      reframe(date = seq(min(date), max(date) + 7, "1 day")) %>% 
      left_join(df) %>%
      mutate(hits=replace_na(hits, 0))
    

    Original answer

    The dplyr::summarise function can be used to add rows for the next 7 days for each combination of name and date:

    library(tidyverse)
    
    ndays=7
    
    df.filled = df %>% 
      mutate(date = as.Date(date)) %>% 
      arrange(name, date) %>% 
      group_by(name, date, hits) %>%
      summarise(date = date + 0:ndays, 
                hits = c(hits, rep(0, ndays))) %>% 
      ungroup()
    
    df.filled %>% filter(name=="Joe") %>% print(n=Inf)
    #> # A tibble: 48 × 3
    #>    name  date        hits
    #>    <chr> <date>     <dbl>
    #>  1 Joe   2004-02-01     5
    #>  2 Joe   2004-02-02     0
    #>  3 Joe   2004-02-03     0
    #>  4 Joe   2004-02-04     0
    #>  5 Joe   2004-02-05     0
    #>  6 Joe   2004-02-06     0
    #>  7 Joe   2004-02-07     0
    #>  8 Joe   2004-02-08     0
    #>  9 Joe   2004-03-05     4
    #> 10 Joe   2004-03-06     0
    #> 11 Joe   2004-03-07     0
    #> 12 Joe   2004-03-08     0
    #> 13 Joe   2004-03-09     0
    #> 14 Joe   2004-03-10     0
    #> 15 Joe   2004-03-11     0
    #> 16 Joe   2004-03-12     0
    #> 17 Joe   2004-08-09    10
    #> 18 Joe   2004-08-10     0
    #> 19 Joe   2004-08-11     0
    #> 20 Joe   2004-08-12     0
    #> 21 Joe   2004-08-13     0
    #> 22 Joe   2004-08-14     0
    #> 23 Joe   2004-08-15     0
    #> 24 Joe   2004-08-16     0
    #> 25 Joe   2004-08-13     9
    #> 26 Joe   2004-08-14     0
    #> 27 Joe   2004-08-15     0
    #> 28 Joe   2004-08-16     0
    #> 29 Joe   2004-08-17     0
    #> 30 Joe   2004-08-18     0
    #> 31 Joe   2004-08-19     0
    #> 32 Joe   2004-08-20     0
    #> 33 Joe   2004-10-20    15
    #> 34 Joe   2004-10-21     0
    #> 35 Joe   2004-10-22     0
    #> 36 Joe   2004-10-23     0
    #> 37 Joe   2004-10-24     0
    #> 38 Joe   2004-10-25     0
    #> 39 Joe   2004-10-26     0
    #> 40 Joe   2004-10-27     0
    #> 41 Joe   2004-11-02     1
    #> 42 Joe   2004-11-03     0
    #> 43 Joe   2004-11-04     0
    #> 44 Joe   2004-11-05     0
    #> 45 Joe   2004-11-06     0
    #> 46 Joe   2004-11-07     0
    #> 47 Joe   2004-11-08     0
    #> 48 Joe   2004-11-09     0
    

    Note, however, that with the code above you could end up with repeated dates if a given name has two dates that are less than 7 days apart. Thus, it's probably safer to do the following: In the code below, we fill in every date from the first to the last + 7 days for each name. Then we join that back to the original data to populate the dates that have non-zero hits.

    df$date = as.Date(df$date)
    
    df.filled2 = df %>% 
      group_by(name) %>% 
      summarise(date = seq(min(date), max(date)+7,"1 day")) %>% 
      left_join(df) %>%
      mutate(hits=replace_na(hits, 0))
    
    df.filled2 %>% filter(name=="Joe") %>% print(n=Inf)
    #> # A tibble: 283 × 3
    #> # Groups:   name [1]
    #>     name  date        hits
    #>     <chr> <date>     <dbl>
    #>   1 Joe   2004-02-01     5
    #>   2 Joe   2004-02-02     0
    #>   3 Joe   2004-02-03     0
    #>   4 Joe   2004-02-04     0
    #>   5 Joe   2004-02-05     0
    #>   6 Joe   2004-02-06     0
    #>   7 Joe   2004-02-07     0
    #>   8 Joe   2004-02-08     0
    #>   9 Joe   2004-02-09     0
    #>  10 Joe   2004-02-10     0
    #>  11 Joe   2004-02-11     0
    #>  12 Joe   2004-02-12     0
    #>  13 Joe   2004-02-13     0
    #>  14 Joe   2004-02-14     0
    #>  15 Joe   2004-02-15     0
    #>  16 Joe   2004-02-16     0
    #>  17 Joe   2004-02-17     0
    #>  18 Joe   2004-02-18     0
    #>  19 Joe   2004-02-19     0
    #>  20 Joe   2004-02-20     0
    #>  21 Joe   2004-02-21     0
    #>  22 Joe   2004-02-22     0
    #>  23 Joe   2004-02-23     0
    #>  24 Joe   2004-02-24     0
    #>  25 Joe   2004-02-25     0
    #>  26 Joe   2004-02-26     0
    #>  27 Joe   2004-02-27     0
    #>  28 Joe   2004-02-28     0
    #>  29 Joe   2004-02-29     0
    #>  30 Joe   2004-03-01     0
    #>  31 Joe   2004-03-02     0
    #>  32 Joe   2004-03-03     0
    #>  33 Joe   2004-03-04     0
    #>  34 Joe   2004-03-05     4
    #>  35 Joe   2004-03-06     0
    #>  36 Joe   2004-03-07     0
    #>  37 Joe   2004-03-08     0
    #>  38 Joe   2004-03-09     0
    #>  39 Joe   2004-03-10     0
    #>  40 Joe   2004-03-11     0
    #>  41 Joe   2004-03-12     0
    #>  42 Joe   2004-03-13     0
    #>  43 Joe   2004-03-14     0
    #>  44 Joe   2004-03-15     0
    #>  45 Joe   2004-03-16     0
    #>  46 Joe   2004-03-17     0
    #>  47 Joe   2004-03-18     0
    #>  48 Joe   2004-03-19     0
    #>  49 Joe   2004-03-20     0
    #>  50 Joe   2004-03-21     0
    #>  51 Joe   2004-03-22     0
    #>  52 Joe   2004-03-23     0
    #>  53 Joe   2004-03-24     0
    #>  54 Joe   2004-03-25     0
    #>  55 Joe   2004-03-26     0
    #>  56 Joe   2004-03-27     0
    #>  57 Joe   2004-03-28     0
    #>  58 Joe   2004-03-29     0
    #>  59 Joe   2004-03-30     0
    #>  60 Joe   2004-03-31     0
    #>  61 Joe   2004-04-01     0
    #>  62 Joe   2004-04-02     0
    #>  63 Joe   2004-04-03     0
    #>  64 Joe   2004-04-04     0
    #>  65 Joe   2004-04-05     0
    #>  66 Joe   2004-04-06     0
    #>  67 Joe   2004-04-07     0
    #>  68 Joe   2004-04-08     0
    #>  69 Joe   2004-04-09     0
    #>  70 Joe   2004-04-10     0
    #>  71 Joe   2004-04-11     0
    #>  72 Joe   2004-04-12     0
    #>  73 Joe   2004-04-13     0
    #>  74 Joe   2004-04-14     0
    #>  75 Joe   2004-04-15     0
    #>  76 Joe   2004-04-16     0
    #>  77 Joe   2004-04-17     0
    #>  78 Joe   2004-04-18     0
    #>  79 Joe   2004-04-19     0
    #>  80 Joe   2004-04-20     0
    #>  81 Joe   2004-04-21     0
    #>  82 Joe   2004-04-22     0
    #>  83 Joe   2004-04-23     0
    #>  84 Joe   2004-04-24     0
    #>  85 Joe   2004-04-25     0
    #>  86 Joe   2004-04-26     0
    #>  87 Joe   2004-04-27     0
    #>  88 Joe   2004-04-28     0
    #>  89 Joe   2004-04-29     0
    #>  90 Joe   2004-04-30     0
    #>  91 Joe   2004-05-01     0
    #>  92 Joe   2004-05-02     0
    #>  93 Joe   2004-05-03     0
    #>  94 Joe   2004-05-04     0
    #>  95 Joe   2004-05-05     0
    #>  96 Joe   2004-05-06     0
    #>  97 Joe   2004-05-07     0
    #>  98 Joe   2004-05-08     0
    #>  99 Joe   2004-05-09     0
    #> 100 Joe   2004-05-10     0
    #> 101 Joe   2004-05-11     0
    #> 102 Joe   2004-05-12     0
    #> 103 Joe   2004-05-13     0
    #> 104 Joe   2004-05-14     0
    #> 105 Joe   2004-05-15     0
    #> 106 Joe   2004-05-16     0
    #> 107 Joe   2004-05-17     0
    #> 108 Joe   2004-05-18     0
    #> 109 Joe   2004-05-19     0
    #> 110 Joe   2004-05-20     0
    #> 111 Joe   2004-05-21     0
    #> 112 Joe   2004-05-22     0
    #> 113 Joe   2004-05-23     0
    #> 114 Joe   2004-05-24     0
    #> 115 Joe   2004-05-25     0
    #> 116 Joe   2004-05-26     0
    #> 117 Joe   2004-05-27     0
    #> 118 Joe   2004-05-28     0
    #> 119 Joe   2004-05-29     0
    #> 120 Joe   2004-05-30     0
    #> 121 Joe   2004-05-31     0
    #> 122 Joe   2004-06-01     0
    #> 123 Joe   2004-06-02     0
    #> 124 Joe   2004-06-03     0
    #> 125 Joe   2004-06-04     0
    #> 126 Joe   2004-06-05     0
    #> 127 Joe   2004-06-06     0
    #> 128 Joe   2004-06-07     0
    #> 129 Joe   2004-06-08     0
    #> 130 Joe   2004-06-09     0
    #> 131 Joe   2004-06-10     0
    #> 132 Joe   2004-06-11     0
    #> 133 Joe   2004-06-12     0
    #> 134 Joe   2004-06-13     0
    #> 135 Joe   2004-06-14     0
    #> 136 Joe   2004-06-15     0
    #> 137 Joe   2004-06-16     0
    #> 138 Joe   2004-06-17     0
    #> 139 Joe   2004-06-18     0
    #> 140 Joe   2004-06-19     0
    #> 141 Joe   2004-06-20     0
    #> 142 Joe   2004-06-21     0
    #> 143 Joe   2004-06-22     0
    #> 144 Joe   2004-06-23     0
    #> 145 Joe   2004-06-24     0
    #> 146 Joe   2004-06-25     0
    #> 147 Joe   2004-06-26     0
    #> 148 Joe   2004-06-27     0
    #> 149 Joe   2004-06-28     0
    #> 150 Joe   2004-06-29     0
    #> 151 Joe   2004-06-30     0
    #> 152 Joe   2004-07-01     0
    #> 153 Joe   2004-07-02     0
    #> 154 Joe   2004-07-03     0
    #> 155 Joe   2004-07-04     0
    #> 156 Joe   2004-07-05     0
    #> 157 Joe   2004-07-06     0
    #> 158 Joe   2004-07-07     0
    #> 159 Joe   2004-07-08     0
    #> 160 Joe   2004-07-09     0
    #> 161 Joe   2004-07-10     0
    #> 162 Joe   2004-07-11     0
    #> 163 Joe   2004-07-12     0
    #> 164 Joe   2004-07-13     0
    #> 165 Joe   2004-07-14     0
    #> 166 Joe   2004-07-15     0
    #> 167 Joe   2004-07-16     0
    #> 168 Joe   2004-07-17     0
    #> 169 Joe   2004-07-18     0
    #> 170 Joe   2004-07-19     0
    #> 171 Joe   2004-07-20     0
    #> 172 Joe   2004-07-21     0
    #> 173 Joe   2004-07-22     0
    #> 174 Joe   2004-07-23     0
    #> 175 Joe   2004-07-24     0
    #> 176 Joe   2004-07-25     0
    #> 177 Joe   2004-07-26     0
    #> 178 Joe   2004-07-27     0
    #> 179 Joe   2004-07-28     0
    #> 180 Joe   2004-07-29     0
    #> 181 Joe   2004-07-30     0
    #> 182 Joe   2004-07-31     0
    #> 183 Joe   2004-08-01     0
    #> 184 Joe   2004-08-02     0
    #> 185 Joe   2004-08-03     0
    #> 186 Joe   2004-08-04     0
    #> 187 Joe   2004-08-05     0
    #> 188 Joe   2004-08-06     0
    #> 189 Joe   2004-08-07     0
    #> 190 Joe   2004-08-08     0
    #> 191 Joe   2004-08-09    10
    #> 192 Joe   2004-08-10     0
    #> 193 Joe   2004-08-11     0
    #> 194 Joe   2004-08-12     0
    #> 195 Joe   2004-08-13     9
    #> 196 Joe   2004-08-14     0
    #> 197 Joe   2004-08-15     0
    #> 198 Joe   2004-08-16     0
    #> 199 Joe   2004-08-17     0
    #> 200 Joe   2004-08-18     0
    #> 201 Joe   2004-08-19     0
    #> 202 Joe   2004-08-20     0
    #> 203 Joe   2004-08-21     0
    #> 204 Joe   2004-08-22     0
    #> 205 Joe   2004-08-23     0
    #> 206 Joe   2004-08-24     0
    #> 207 Joe   2004-08-25     0
    #> 208 Joe   2004-08-26     0
    #> 209 Joe   2004-08-27     0
    #> 210 Joe   2004-08-28     0
    #> 211 Joe   2004-08-29     0
    #> 212 Joe   2004-08-30     0
    #> 213 Joe   2004-08-31     0
    #> 214 Joe   2004-09-01     0
    #> 215 Joe   2004-09-02     0
    #> 216 Joe   2004-09-03     0
    #> 217 Joe   2004-09-04     0
    #> 218 Joe   2004-09-05     0
    #> 219 Joe   2004-09-06     0
    #> 220 Joe   2004-09-07     0
    #> 221 Joe   2004-09-08     0
    #> 222 Joe   2004-09-09     0
    #> 223 Joe   2004-09-10     0
    #> 224 Joe   2004-09-11     0
    #> 225 Joe   2004-09-12     0
    #> 226 Joe   2004-09-13     0
    #> 227 Joe   2004-09-14     0
    #> 228 Joe   2004-09-15     0
    #> 229 Joe   2004-09-16     0
    #> 230 Joe   2004-09-17     0
    #> 231 Joe   2004-09-18     0
    #> 232 Joe   2004-09-19     0
    #> 233 Joe   2004-09-20     0
    #> 234 Joe   2004-09-21     0
    #> 235 Joe   2004-09-22     0
    #> 236 Joe   2004-09-23     0
    #> 237 Joe   2004-09-24     0
    #> 238 Joe   2004-09-25     0
    #> 239 Joe   2004-09-26     0
    #> 240 Joe   2004-09-27     0
    #> 241 Joe   2004-09-28     0
    #> 242 Joe   2004-09-29     0
    #> 243 Joe   2004-09-30     0
    #> 244 Joe   2004-10-01     0
    #> 245 Joe   2004-10-02     0
    #> 246 Joe   2004-10-03     0
    #> 247 Joe   2004-10-04     0
    #> 248 Joe   2004-10-05     0
    #> 249 Joe   2004-10-06     0
    #> 250 Joe   2004-10-07     0
    #> 251 Joe   2004-10-08     0
    #> 252 Joe   2004-10-09     0
    #> 253 Joe   2004-10-10     0
    #> 254 Joe   2004-10-11     0
    #> 255 Joe   2004-10-12     0
    #> 256 Joe   2004-10-13     0
    #> 257 Joe   2004-10-14     0
    #> 258 Joe   2004-10-15     0
    #> 259 Joe   2004-10-16     0
    #> 260 Joe   2004-10-17     0
    #> 261 Joe   2004-10-18     0
    #> 262 Joe   2004-10-19     0
    #> 263 Joe   2004-10-20    15
    #> 264 Joe   2004-10-21     0
    #> 265 Joe   2004-10-22     0
    #> 266 Joe   2004-10-23     0
    #> 267 Joe   2004-10-24     0
    #> 268 Joe   2004-10-25     0
    #> 269 Joe   2004-10-26     0
    #> 270 Joe   2004-10-27     0
    #> 271 Joe   2004-10-28     0
    #> 272 Joe   2004-10-29     0
    #> 273 Joe   2004-10-30     0
    #> 274 Joe   2004-10-31     0
    #> 275 Joe   2004-11-01     0
    #> 276 Joe   2004-11-02     1
    #> 277 Joe   2004-11-03     0
    #> 278 Joe   2004-11-04     0
    #> 279 Joe   2004-11-05     0
    #> 280 Joe   2004-11-06     0
    #> 281 Joe   2004-11-07     0
    #> 282 Joe   2004-11-08     0
    #> 283 Joe   2004-11-09     0
    

    The second approach will in general result in many more rows of data. If you want to keep a maximum of 7 rows after any date with non-zero hits, you can do the following:

    df.filled2 = df.filled2 %>% 
      group_by(name) %>% 
      mutate(test=cumsum(hits > 0)) %>% 
      group_by(name, test) %>% 
      slice(1:8) %>%
      ungroup %>% 
      select(-test)