Search code examples
rdatedplyrlubridatecategorization

categorize based on date ranges in R


How do I categorize each row in a large R dataframe (>2 million rows) based on date range definitions in a separate, much smaller R dataframe (12 rows)?

My large dataframe, captures, looks similar to this when called via head(captures) :

       id       date sex
1  160520 2016-11-22   1
2 1029735 2016-11-12   1
3 1885200 2016-11-05   1
4 2058366 2015-09-26   2
5 2058367 2015-09-26   1
6 2058368 2015-09-26   1

My small dataframe, seasons, looks similar to this in its entirety:

Season Opening.Date Closing.Date
  2016   2016-09-24   2017-01-15
  2015   2015-09-26   2016-01-10
  2014   2014-09-27   2015-01-11
  2013   2013-09-28   2014-01-12
  2012   2012-09-22   2013-01-13
  2011   2011-09-24   2012-01-08
  2010   2010-09-25   2011-01-16
  2009   2009-09-26   2010-01-17
  2008   2008-09-27   2009-01-18
  2007   2007-09-22   2008-01-13
  2006   2006-09-23   2007-01-14
  2005   2005-09-24   2006-01-15 

I need to add a 'season' column to my captures dataframe where the value would be determined based on if and where captures$date falls in the ranges defined in seasons.

Here is a long-hand solution I came up with that isn't working for me because my dataframe is so large.

#add packages
library(dplyr)
library(lubridate)
#make blank column
captures$season=NA
for (i in 1:length(seasons$Season)){
  for (j in 1:length(captures$id{
    captures$season[j]=ifelse(between(captures$date[j],ymd(seasons$Opening.Date[i]),ymd(seasons$Closing.Date[i])),seasons$Season[i],captures$season[j])
  }
}

Again, this doesn't work for me as R crashes every time. I also realize this doesn't take advantage of vectorization in R. Any help here is appreciated!


Solution

  • It would be great indeed if you could do a join operation efficiently based on a range of values instead of equality. Unfortunately, I don't know if a general solution exists. In the time being, I suggest using a single for loop.

    The efficiency of vectorization is best done along the tallest data. That is, if we loop on one data.frame and vectorize the other, it makes more sense to vectorize the longer vector and loop on the shorter ones. With this in mind, we'll loop on the frame of seasons and vectorize the 2M rows of data.

    Your data:

    txt <- "Season Opening.Date Closing.Date
      2016   2016-09-24   2017-01-15
      2015   2015-09-26   2016-01-10
      2014   2014-09-27   2015-01-11
      2013   2013-09-28   2014-01-12
      2012   2012-09-22   2013-01-13
      2011   2011-09-24   2012-01-08
      2010   2010-09-25   2011-01-16
      2009   2009-09-26   2010-01-17
      2008   2008-09-27   2009-01-18
      2007   2007-09-22   2008-01-13
      2006   2006-09-23   2007-01-14
      2005   2005-09-24   2006-01-15"
    seasons <- read.table(text = txt, header = TRUE)
    seasons[2:3] <- lapply(seasons[2:3], as.Date)
    
    txt <- "       id       date sex
    1  160520 2016-11-22   1
    2 1029735 2016-11-12   1
    3 1885200 2016-11-05   1
    4 2058366 2015-09-26   2
    5 2058367 2015-09-26   1
    6 2058368 2015-09-26   1"
    dat <- read.table(text = txt, header = TRUE)
    dat$date <- as.Date(dat$date)
    

    And the start the process, we assume that all data's season is as yet not defined:

    dat$season <- NA
    

    Loop around each of the seasons' rows:

    for (i in seq_len(nrow(seasons))) {
      dat$season <- ifelse(is.na(dat$season) &
                             dat$date >= seasons$Opening.Date[i] &
                             dat$date < seasons$Closing.Date[i],
                           seasons$Season[i], dat$season)                       
    }
    dat
    #        id       date sex season
    # 1  160520 2016-11-22   1   2016
    # 2 1029735 2016-11-12   1   2016
    # 3 1885200 2016-11-05   1   2016
    # 4 2058366 2015-09-26   2   2015
    # 5 2058367 2015-09-26   1   2015
    # 6 2058368 2015-09-26   1   2015