Search code examples
rdatetime-seriestidyverselubridate

matching yearly time points to preceding 365 days of data in R


I am trying to merge two datasets. The survey dataset consists of biodiversity surveys from different regions conducted every 1-5 years in a certain month (the month is constant within, but not between, regions). The temperature dataset consists of daily temperature readings in each survey region.

For multiple surveys that have different start months and temporal extents, I want to pair each survey*year combination with the twelve months of temperature data preceding it. In other words, I want to pair a May 1983 survey with the 12 months (or 365 days -- I don't care which) of daily temperature records preceding it, ending April 30, 1983. Meanwhile, another survey elsewhere conducted in August 1983 needs to be paired with the 365 days of temperature data ending July 31, 1983.

There are (at least) two ways to do this -- one would be joining the survey data to the (longer) temperature data and then somehow subsetting or identifying which dates fall in the 12 months preceding the survey-date. Another is to start with the survey data and try to pair the temperature data to each row with a matrix-column -- I tried doing this with time-series tools from tsibble and tsModel but couldn't get it to "lag" the right values when grouped by region.

I was able to create an identifier to join the datasets such that each date in the temperature data is matched with the subsequent survey in time. However, not all of those are within 365 days (e.g., in the dataset created below, the date 1983-06-03 is matched with the ref_year aleutian_islands-5-1986 because the survey only happens every 3-5 years).

Here are some examples of the behavior I want for a single region (from the example dataset below), although I'm open to solutions that achieve the same thing but don't look exactly like this:

For this row, the value in the new column that I want to generate (ref_match) should be NA; the date is more than 365 days before ref_year.

  region           date        year month month_year ref_year                temperature     
  <chr>            <date>     <dbl> <dbl> <chr>      <chr>                         <dbl>
1 aleutian_islands 1982-06-09  1982     6 6-1982     aleutian_islands-5-1983           0   

For this row, ref_match should be aleutian_islands-5-2014 since the date is within 12 months of ref_year.

  region           date        year month month_year ref_year                temperature
  <chr>            <date>     <dbl> <dbl> <chr>      <chr>                         <dbl>
1 aleutian_islands 2013-07-22  2013     7 7-2013     aleutian_islands-5-2014       0.998

The following script will generate a dataset temp_dat with columns like those in the snippets above from which I hope to generate the ref_match column.

# load packages
library(tidyverse)
library(lubridate)
set.seed=10

# make survey dfs
ai_dat <- data.frame("year" = c(1983, 1986, 1991, 1994, 1997), "region" = "aleutian_islands", "startmonth" = 5)
ebs_dat <- data.frame("year" = seq(1983, 1999, 1), "region" = "eastern_bering_sea", "startmonth" = 5)

# join and create what will become ref_year column
surv_dat <- rbind(ai_dat, ebs_dat) %>% 
  mutate(month_year = paste0(startmonth,"-",year)) %>%
  select(region, month_year) %>%
  distinct() %>%
  mutate(region_month_year = paste0(region,"-",month_year))

# expand out to all possible month*year combinations for joining with temperature
surv_dat_exploded <- expand.grid(month=seq(1, 12, 1), year=seq(1982, 2000, 1), region=c('aleutian_islands','eastern_bering_sea')) %>% # get a factorial combo of every possible month*year; have to start in 1982 even though we can't use surveys before 1983 because we need to match to temperature data from 1982
  mutate(region_month_year = paste0(region,"-",month,"-",year)) %>% # create unique identifier
  mutate(ref_year = ifelse(region_month_year %in% surv_dat$region_month_year, region_month_year, NA),
         month_year = paste0(month,"-",year)) %>% 
  select(region, month_year, ref_year) %>% 
  distinct() %>% 
  group_by(region) %>% 
  fill(ref_year, .direction="up") %>%  # fill in each region with the survey to which env data from each month*year should correspond
  ungroup() 

# make temperature dataset and join in survey ref_year column 
temp_dat <- data.frame(expand.grid(date=seq(ymd("1982-01-01"), ymd("1999-12-31"), "days"), region=c('aleutian_islands','eastern_bering_sea'))) %>% 
  mutate(temperature = rnorm(nrow(.), 10, 5),  # fill in with fake data
         year = year(date),
         month = month(date),
         month_year = paste0(month,"-",year)) %>% 
  left_join(surv_dat_exploded, by=c('region','month_year')) %>% 
  filter(!is.na(ref_year))# get rid of dates that are after any ref_year

Solution

  • Try this solution.

    I basically used your reference column to generate a ref_date and estimate the difference in days between the observation and reference. Then, I used a simple ifelse to test if the dates fall within the 365 days range and then copy them to the temp_valid column.

    
    # load packages
    library(tidyverse)
    library(lubridate)
    set.seed=10
    
    # make survey dfs
    ai_dat <- data.frame("year" = c(1983, 1986, 1991, 1994, 1997), "region" = "aleutian_islands", "startmonth" = 5)
    ebs_dat <- data.frame("year" = seq(1983, 1999, 1), "region" = "eastern_bering_sea", "startmonth" = 5)
    
    # join and create what will become ref_year column
    surv_dat <-
      rbind(ai_dat, ebs_dat) %>% 
      mutate(year_month = paste0(year,"-",startmonth),
             region_year_month = paste0(region,"-",year,"-",startmonth)) 
    
    
    # expand out to all possible month*year combinations for joining with temperature
    surv_dat_exploded <-
      expand.grid(month=seq(01, 12, 1), year=seq(1982, 2000, 1), region=c('aleutian_islands','eastern_bering_sea')) %>% # get a factorial combo of every possible month*year; have to start in 1982 even though we can't use surveys before 1983 because we need to match to temperature data from 1982
      mutate(year_month = paste0(year,"-",month)) %>%
      mutate(region_year_month = paste0(region,"-",year,"-",month)) %>% 
      mutate(ref_year = ifelse(region_year_month %in% surv_dat$region_year_month, region_year_month,NA)) %>%
      group_by(region) %>% 
      fill(ref_year, .direction="up") %>%  # fill in each region with the survey to which env data from each month*year should correspond
      ungroup() 
    
    # make temperature dataset and join in survey ref_year column 
    temp_dat <- data.frame(expand.grid(date=seq(ymd("1982-01-01"), ymd("1999-12-31"), "days"), region=c('aleutian_islands','eastern_bering_sea'))) %>% 
      mutate(temperature = rnorm(nrow(.), 10, 5),  # fill in with fake data
             year = year(date),
             month = month(date),
             year_month = paste0(year,"-",month))
    
    final_df <- 
      left_join(temp_dat, surv_dat_exploded, by=c('region','year_month')) %>% 
        #split ref_column in ref_year and ref_region
      separate(ref_year, c("ref_region","ref_year"), "-", extra="merge") %>%
        #convert ref_year into date
        mutate_at("ref_year", as.Date, format= "%Y-%M") %>% 
        #round it down to be in the first day of the month (not needed if the day matters)
        mutate_at("ref_year", floor_date, "month" ) %>% 
        #difference between observed and the reference
        mutate(diff_days = date - ref_year) %>% 
        # ifelse statement for capturing values of interest
        mutate(temp_valid = ifelse(between(diff_days, -365, 0),temperature,NA))