How do I categorize each row in a large R dataframe (>2 million rows) based on date range definitions in a separate, much smaller R dataframe (12 rows)?
My large dataframe, captures, looks similar to this when called via head(captures)
:
id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1
My small dataframe, seasons, looks similar to this in its entirety:
Season Opening.Date Closing.Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15
I need to add a 'season' column to my captures dataframe where the value would be determined based on if and where captures$date
falls in the ranges defined in seasons.
Here is a long-hand solution I came up with that isn't working for me because my dataframe is so large.
#add packages
library(dplyr)
library(lubridate)
#make blank column
captures$season=NA
for (i in 1:length(seasons$Season)){
for (j in 1:length(captures$id{
captures$season[j]=ifelse(between(captures$date[j],ymd(seasons$Opening.Date[i]),ymd(seasons$Closing.Date[i])),seasons$Season[i],captures$season[j])
}
}
Again, this doesn't work for me as R crashes every time. I also realize this doesn't take advantage of vectorization in R. Any help here is appreciated!
It would be great indeed if you could do a join
operation efficiently based on a range of values instead of equality. Unfortunately, I don't know if a general solution exists. In the time being, I suggest using a single for
loop.
The efficiency of vectorization is best done along the tallest data. That is, if we loop on one data.frame and vectorize the other, it makes more sense to vectorize the longer vector and loop on the shorter ones. With this in mind, we'll loop on the frame of seasons and vectorize the 2M rows of data.
Your data:
txt <- "Season Opening.Date Closing.Date
2016 2016-09-24 2017-01-15
2015 2015-09-26 2016-01-10
2014 2014-09-27 2015-01-11
2013 2013-09-28 2014-01-12
2012 2012-09-22 2013-01-13
2011 2011-09-24 2012-01-08
2010 2010-09-25 2011-01-16
2009 2009-09-26 2010-01-17
2008 2008-09-27 2009-01-18
2007 2007-09-22 2008-01-13
2006 2006-09-23 2007-01-14
2005 2005-09-24 2006-01-15"
seasons <- read.table(text = txt, header = TRUE)
seasons[2:3] <- lapply(seasons[2:3], as.Date)
txt <- " id date sex
1 160520 2016-11-22 1
2 1029735 2016-11-12 1
3 1885200 2016-11-05 1
4 2058366 2015-09-26 2
5 2058367 2015-09-26 1
6 2058368 2015-09-26 1"
dat <- read.table(text = txt, header = TRUE)
dat$date <- as.Date(dat$date)
And the start the process, we assume that all data's season
is as yet not defined:
dat$season <- NA
Loop around each of the seasons' rows:
for (i in seq_len(nrow(seasons))) {
dat$season <- ifelse(is.na(dat$season) &
dat$date >= seasons$Opening.Date[i] &
dat$date < seasons$Closing.Date[i],
seasons$Season[i], dat$season)
}
dat
# id date sex season
# 1 160520 2016-11-22 1 2016
# 2 1029735 2016-11-12 1 2016
# 3 1885200 2016-11-05 1 2016
# 4 2058366 2015-09-26 2 2015
# 5 2058367 2015-09-26 1 2015
# 6 2058368 2015-09-26 1 2015