Search code examples
rdata.tabletidyverserunner

Calculating running amounts using data.table does not count days with no values?


I have two dataframes with values for two different years, categorized by group, and I'd like to combine them together to calculate running amounts per group while filling in missing dates. Here are the example dataframes I'm working with.

df1 <- data.frame(
  group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE),
  date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365),
  values = rnorm(365, 10, 1)
)

df2 <- data.frame(
  group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE),
  date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365),
  values = rnorm(365, 10, 1)
)

This create two random dataframes with a random value for each date, but for only one group, so there are missing dates. Below, I use data.table functions to combine the two dataframes together, fill in the missing dates, and calculate the mean average running amounts per group using the runner package. I noticed, though, that dates with values of 0 are not counted against the mean. For example, the resulting data frame from this code:

df <- rbindlist(list(df1, df2)) 

idx <- df[,.(date = seq(min(date), max(date), "day")), by = group]
setkey(df, group, date)
setkey(idx, group, date)

df <- df[idx] %>%
  setorder(date) %>%
  .[,.(
    date = lubridate::ymd(date),
    values = ifelse(is.na(values), 0, as.numeric(values)),
    running_values = runner::mean_run(x = values, k = 90, idx = date)
  ), by = group]

results in a dataframe that looks like this:

group date values running_values
g1 2023-01-04 9.412 9.412
g1 2023-01-05 0 9.412
g1 2023-01-06 0 9.412
g1 2023-01-07 0 9.412
g1 2023-01-08 10.788 10.100
g1 2023-01-09 0 10.100

However, the output that I expect (using tidyverse functions instead) is:

df <- bind_rows(df1, df2) %>%
  complete(
    nesting(group),
    date = seq.Date(min(date), max(date), by = "day")
  ) %>%
  group_by(group) %>%
  arrange(date) %>%
  mutate(
    values = ifelse(is.na(values), 0, as.numeric(values)),
    date = lubridate::ymd(date),
    running_values = runner::mean_run(x = values, k = 90, idx = date)
  )

This gives the dataframe I want.

group date values running_values
g1 2023-01-01 0 0
g1 2023-01-02 0 0
g1 2023-01-03 0 0
g1 2023-01-04 9.412 2.353
g1 2023-01-05 0 1.882
g1 2023-01-06 0 1.568

Using the tidyverse method will also make the groups start at the earliest date across all groups, while the data.table will only use the earliest date per each group, though it's not my main concern. How do I reproduce the tidyverse table but using data.table functions so I don't have to load those dependencies? A simple solution would seem to be to convert the NAs to arbitrarily small numbers (0.000000001) which seems to work well enough, but I'd like to understand why this difference occurs at all.

df1 <- data.frame(
  group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE),
  date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365),
  values = rnorm(365, 10, 1)
)

df2 <- data.frame(
  group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE),
  date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365),
  values = rnorm(365, 10, 1)
)

# Bad output (data.table method)
df <- rbindlist(list(df1, df2)) 

idx <- df[,.(date = seq(min(date), max(date), "day")), by = group]
setkey(df, group, date)
setkey(idx, group, date)

df <- df[idx] %>%
  setorder(date) %>%
  .[,.(
    date = lubridate::ymd(date),
    values = ifelse(is.na(values), 0, as.numeric(values)),
    running_values = runner::mean_run(x = values, k = 90, idx = date)
  ), by = group]

# Desired output (tidyverse method)
df <- bind_rows(df1, df2) %>%
  complete(
    nesting(group),
    date = seq.Date(min(date), max(date), by = "day")
  ) %>%
  group_by(group) %>%
  arrange(date) %>%
  mutate(
    values = ifelse(is.na(values), 0, as.numeric(values)),
    date = lubridate::ymd(date),
    running_values = runner::mean_run(x = values, k = 90, idx = date)
  )

Solution

  • You can use the cross-product function CJ() to create a filled panel (every day for each group) and right-join it to the data:

    df <- rbind(df1,df2)[
      CJ(group=unique(df$group), date=seq(min(df$date), max(df$date), "day")),
      on=.(group, date)
      ][is.na(values), values:=0]
    

    Then do the calculation:

    df[, running_values := runner::mean_run(x = values, k = 90, idx = date), by=.(group)]
    

    Edit - adding an explanation of why the original attempt didn't work:

    First, as an aside, your example data has multiple values per group per day in some cases, so we should probably be aggregating to max one row per date/group:

    rbind(df1,df2)[, .(value=sum(values)), keyby=.(group, date)]
    

    (NB, rbindlist() is intended for large lists of unknown length - rbind will despatch to rbind.data.table, so is all you need)

    This will give 4 * 365 = 1460 rows after joining with the filled panel frame.

    As for the other issues, setkey just orders the data (and tags it as ordered) - it isn't the issue. (And you don't actually need to key df here - see below.)

    The issues are:

    idx <- df[,.(date = seq(min(date), max(date), "day")), by = group]
    

    This explicitly does what you don't want - it creates a series by group from the group min to the group max. This is why you need CJ() (or else replace with seq(min(df$date), max(df$date), "day")).

    There are a lot of odd things next, which must be the result of you trying different things to see what happens. You set keys on both the df and idx tables (even though they are both already in the right order thanks to the join with idx, and you don't even use the idx table again). But then you immediately undo this by putting df in the wrong order:

    df |> setorder(date) # now cycles g1,g2,...,g4 by date
    

    Next:

    date is already a date so you don't need to convert it to one with lubridate().

    You convert NA to 0 in values and give the output the same name values - this bit is okay.

    But you create running_values in the same step in the pipe, which means that it will be based on the original values (with NAs). You need to do this in a second step instead.

    In short the only steps you need are:

    • join with a full panel - everything is now in the right order for runner
    • convert NAs to 0's in what results
    • create the new column on the converted data