Search code examples
rdplyrfill

Fill dates by groups in a data frame R


I wanna go from this:

name code date usage result
Jennifer Aniston 23211 2021-11-04 345 1
Jennifer Aniston 23211 2021-11-05 260 1
Jennifer Aniston 23211 2021-11-06 230 0
Jennifer Aniston 23211 2021-11-07 0 0
Matthew Perry 44215 2022-10-01 312 1
Matthew Perry 44215 2022-10-04 230 0
Matthew Perry 44215 2021-10-05 232 0
Lisa Kudrow 55120 2022-01-01 132 0
Lisa Kudrow 55120 2022-01-02 125 0
Lisa Kudrow 55120 2022-01-04 345 1
Lisa Kudrow 55120 2022-01-06 321 1
Lisa Kudrow 55120 2022-01-07 431 1

(note: for Jennifer Aniston we have all the dates from the minimum date that appears and the maximum date, but for Matthew Perry we have missing dates, it starts at 2022-10-01 but we don't have the 2 and the 3 of october. The same happens with Lisa Kudrow, it starts at 1 january but we miss 3 and 5 of january)

to this:

name code date usage result
Jennifer Aniston 23211 2021-11-04 345 1
Jennifer Aniston 23211 2021-11-05 260 1
Jennifer Aniston 23211 2021-11-06 230 0
Jennifer Aniston 23211 2021-11-07 0 0
Matthew Perry 44215 2022-10-01 312 1
Matthew Perry 44215 2022-10-02 NA NA
Matthew Perry 44215 2022-10-03 NA NA
Matthew Perry 44215 2022-10-04 230 0
Matthew Perry 44215 2021-10-05 232 0
Lisa Kudrow 55120 2022-01-01 132 0
Lisa Kudrow 55120 2022-01-02 125 0
Lisa Kudrow 55120 2022-01-03 NA NA
Lisa Kudrow 55120 2022-01-04 345 1
Lisa Kudrow 55120 2022-01-05 NA NA
Lisa Kudrow 55120 2022-01-06 321 1
Lisa Kudrow 55120 2022-01-07 431 1

So now we have all the dates, filling with an NA where we don't have data available and with the name and the code for the person.

Any idea of howing this in R? (preferibly using dplyr and pipes)


Solution

  • Assuming that your dates are actually in date format rather than character format (we can't tell from the table in the question), and assuming that the 7th row has the wrong year in it (2021 as opposed to 2022), you can do:

    library(tidyverse)
    
    df %>% 
      split(.$name) %>%
      lapply(function(x) {
        complete(x, expand(x, date = seq(min(x$date), max(x$date), by = 'day')),
                 fill = list(name = x$name[1], code = x$code[1]))}) %>%
      bind_rows()
    #> # A tibble: 16 x 5
    #>    date       name              code usage result
    #>    <date>     <chr>            <int> <int>  <int>
    #>  1 2021-11-04 Jennifer Aniston 23211   345      1
    #>  2 2021-11-05 Jennifer Aniston 23211   260      1
    #>  3 2021-11-06 Jennifer Aniston 23211   230      0
    #>  4 2021-11-07 Jennifer Aniston 23211     0      0
    #>  5 2022-01-01 Lisa Kudrow      55120   132      0
    #>  6 2022-01-02 Lisa Kudrow      55120   125      0
    #>  7 2022-01-03 Lisa Kudrow      55120    NA     NA
    #>  8 2022-01-04 Lisa Kudrow      55120   345      1
    #>  9 2022-01-05 Lisa Kudrow      55120    NA     NA
    #> 10 2022-01-06 Lisa Kudrow      55120   321      1
    #> 11 2022-01-07 Lisa Kudrow      55120   431      1
    #> 12 2022-10-01 Matthew Perry    44215   312      1
    #> 13 2022-10-02 Matthew Perry    44215    NA     NA
    #> 14 2022-10-03 Matthew Perry    44215    NA     NA
    #> 15 2022-10-04 Matthew Perry    44215   230      0
    #> 16 2022-10-05 Matthew Perry    44215   232      0
    

    Created on 2022-06-02 by the reprex package (v2.0.1)


    Data taken from question in reproducible format

    df <- structure(list(name = c("Jennifer Aniston", "Jennifer Aniston", 
    "Jennifer Aniston", "Jennifer Aniston", "Matthew Perry", "Matthew Perry", 
    "Matthew Perry", "Lisa Kudrow", "Lisa Kudrow", "Lisa Kudrow", 
    "Lisa Kudrow", "Lisa Kudrow"), code = c(23211L, 23211L, 23211L, 
    23211L, 44215L, 44215L, 44215L, 55120L, 55120L, 55120L, 55120L, 
    55120L), date = structure(c(18935, 18936, 18937, 18938, 19266, 
    19269, 19270, 18993, 18994, 18996, 18998, 18999), class = "Date"), 
        usage = c(345L, 260L, 230L, 0L, 312L, 230L, 232L, 132L, 125L, 
        345L, 321L, 431L), result = c(1L, 1L, 0L, 0L, 1L, 0L, 0L, 
        0L, 0L, 1L, 1L, 1L)), row.names = c(NA, -12L), class = "data.frame")