Search code examples
rdatecutbins

R: Cutting a year of dates into 2 month bins yields 7 bins instead of 6?


I am trying to use the cut() function in R to divide a year of dates into 6 two month bins. When I do, it makes 7 bins instead of 6, with the last bin being empty. I am using the following code:

dates <- seq(as.Date("2021-1-1"),as.Date("2021-12-31"),by="day")
months <- cut(dates,"month",labels=1:12)
table(months)
# months
#  1  2  3  4  5  6  7  8  9 10 11 12 
# 31 28 31 30 31 30 31 31 30 31 30 31 
sextiles <- cut(dates,"2 months",labels=1:6)
# Error in cut.default(unclass(x), unclass(breaks), labels = labels, right = right,  : 
#   lengths of 'breaks' and 'labels' differ
sextiles <- cut(dates,"2 months",labels=1:7)
table(sextiles)
# sextiles
#  1  2  3  4  5  6  7 
# 59 61 61 62 61 61  0 

The code works fine when I divide the year into single month bins, but produces an error when I divide into 2 month bins, unless I account for 7 bins instead of 6 in the labels argument. If I start removing dates from the end of the year, the code eventually works with 6 bins after removing the last 3 days of the year:

dates_364 <- dates[-length(dates)]
sextiles <- cut(dates_364,"2 months",labels=1:6)
# Error in cut.default(unclass(x), unclass(breaks), labels = labels, right = right,  : 
#   lengths of 'breaks' and 'labels' differ
dates_363 <- dates_364[-length((dates_364))]
sextiles <- cut(dates_363,"2 months",labels=1:6)
# Error in cut.default(unclass(x), unclass(breaks), labels = labels, right = right,  : 
#   lengths of 'breaks' and 'labels' differ
dates_362 <- dates_363[-length((dates_363))]
sextiles <- cut(dates_362,"2 months",labels=1:6)
table(sextiles)
# sextiles
#  1  2  3  4  5  6 
# 59 61 61 62 61 58 

This seems like a bug in the function. Can anyone shed any light on something I'm missing? Thanks!


Solution

  • Thanks to insight provided by @r2evans, I found the answer to my question.

    There is a bug in the code for the cut.Date function. Lines 31 through 41 handle the case where breaks are in months:

    if (valid == 3L) {
      start$mday <- 1L
      start$isdst <- -1L
      end <- as.POSIXlt(max(x, na.rm = TRUE))
      step <- if (length(by2) == 2L) 
        as.integer(by2[1L])
      else 1L
      end <- as.POSIXlt(end + (31 * step * 86400))
      end$mday <- 1L
      end$isdst <- -1L
      breaks <- as.Date(seq(start, end, breaks))
    

    Line 38, end <- as.POSIXlt(end + (31 * step * 86400)) adjusts the end ahead by 31 days times the step, or number of months in each bin. Because not all months have 31 days, there are instances where the end gets pushed back far enough to create a superfluous bin. This can easily be corrected with a few lines of code, as we see in the case when the breaks are in quarters. See lines 57 through 75:

    else if (valid == 5L) {
      qtr <- rep(c(0L, 3L, 6L, 9L), each = 3L)
      start$mon <- qtr[start$mon + 1L]
      start$mday <- 1L
      start$isdst <- -1L
      maxx <- max(x, na.rm = TRUE)           # Note this line
      end <- as.POSIXlt(maxx)                # Note this line
      step <- if (length(by2) == 2L) 
        as.integer(by2[1L])
      else 1L
      end <- as.POSIXlt(end + (93 * step * 86400))
      end$mon <- qtr[end$mon + 1L]
      end$mday <- 1L
      end$isdst <- -1L
      breaks <- as.Date(seq(start, end, paste(step * 3L, 
                                              "months")))
      lb <- length(breaks)                   # Note this line
      if (maxx < breaks[lb - 1])             # If extra bin
        breaks <- breaks[-lb]                # Then remove extra bin
    

    If we employ this same method and modify the section of code dealing with breaks="months":

    if (valid == 3L) {
      start$mday <- 1L
      start$isdst <- -1L
      maxx <- max(x, na.rm = TRUE)     # Line added
      end <- as.POSIXlt(maxx)          # Line modified
      step <- if (length(by2) == 2L) 
        as.integer(by2[1L])
      else 1L
      end <- as.POSIXlt(end + (31 * step * 86400))
      end$mday <- 1L
      end$isdst <- -1L
      breaks <- as.Date(seq(start, end, breaks))
      lb <- length(breaks)             # Line added
      if (maxx < breaks[lb - 1])       # Line added
        breaks <- breaks[-lb]          # Line added
    

    Storing the modified function in cut_Date, we get the following:

    dates <- seq(as.Date("2021-1-1"),as.Date("2021-12-31"),by="day")
    sextiles <- cut(dates,"2 months",labels=1:6)
    # Error in cut.default(unclass(x), unclass(breaks), labels = labels, right = right,  : 
    #   lengths of 'breaks' and 'labels' differ
    sextiles <- cut_Date(dates,"2 months",labels=1:6)
    table(sextiles)
    # sextiles
    #  1  2  3  4  5  6 
    # 59 61 61 62 61 61
    

    Bug fixed!