Search code examples
rsequences

Fill in Missing Weekly Data Points in R as 0 for Each ID


I have data in this shape:

> head(posts)
     id    week_number num_posts
1 UKL1.1           1         4
2 UKL1.1           6         9
3 UKL1.2           1         2
4 UKL1.3           1         8
5 UKL1.3           2         7
6 UKL1.3           3         3

and I want to make it such that each id has a row for each week_number (1,2,3,4,5,6) and if that week_number isn't already in the data then posts should = 0

I've seen this done using the package zoo with true time-series data, but without creating a proper POSIXct or Date version of week_number and using that package is there a way to do this directly?


Solution

  • Here's a way using data.table.

    library(data.table)
    setDT(posts)                           # convert posts to a data.table
    all.wks <- posts[,list(week_number=min(week_number):max(week_number)),by=id]
    setkey(posts,id,week_number)           # index on id and week number
    setkey(all.wks,id,week_number)         # index on id and week number
    result <- posts[all.wks]               # data.table join is very fast
    result[is.na(num_posts),num_posts:=0]  # convert NA to 0
    result
    #         id week_number num_posts
    #  1: UKL1.1           1         4
    #  2: UKL1.1           2         0
    #  3: UKL1.1           3         0
    #  4: UKL1.1           4         0
    #  5: UKL1.1           5         0
    #  6: UKL1.1           6         9
    #  7: UKL1.2           1         2
    #  8: UKL1.3           1         8
    #  9: UKL1.3           2         7
    # 10: UKL1.3           3         3
    

    Another way:

    my_fun <- function(x) {
        weeks = with(x, min(week_number):max(week_number))
        posts = with(x, num_posts[match(weeks, week_number)])
        list(week_number=weeks, num_posts=posts)
    }
    setDT(posts)[, my_fun(.SD), by=id]
    

    .SD means subset of data; it contains the data subset corresponding to each group specified in by, with all columns excluding the grouping column = id.

    Then you can replace NAs as shown above.