Search code examples
rdataframesubsetdifftime

Subsetting rows based on dates and criteria across two data frames


I have one data frame outlining pollution levels continuously measured from two sites.

Dates <- as.data.frame(seq(as.Date("2015/01/01"), as.Date("2017/01/01"),"day"))
Pollution_Site.A <- as.data.frame(c(seq(from = 1, to = 366, by = 1),
                       (seq(from = 366, to = 1, by = -1))))
Pollution_Site.B <- as.data.frame(c(seq(from = 0, to = 365, by = 1),
                                (seq(from = 365, to = 0, by = -1))))
df1 <- cbind(Dates,Pollution_Site.A,Pollution_Site.B)
colnames(df1) <- c("Dates","Site.A","Site.B")

I have a separate data frame highlighting when surveyors (each site has one unique surveyor) visited each site.

Site<- c("Site.A","Site.A","Site.B","Site.B")
Survey_Dates <- as.data.frame(as.POSIXct(c("2014/08/17","2016/08/01",
                                      "2015/02/01","2016/10/31")))
df2 <- as.data.frame(cbind(Site,Survey_Dates))
colnames(df2) <- c("Site","Survey_Dates")

What I want to do is (i) define a high pollution event (although perhaps some form of 'apply' function would be better to do this iteratively across multiple sites)?

 High_limit_Site.A <- 1.5*median(df1$Site.A)
 High_limit_Site.B <- 1.5*median(df1$Site.B)

The I want to (ii) subset the second data frame to show which surveyors have visited the site before and after a high pollution event within 1 year (providing there is pollution data as well). I presume something along the 'difftime' function will work here, but am not sure how I would apply this.

Finally, I would like (iii) the subsetted data frame to highlight whether the surveyor was out before or after the pollution event.

So in the example above, the desired output should only contain Site B. This is because Site A's first survey date precedes the first pollution measurement AND was over a year before the high pollution event. Thank you in advance for any help on this.


Solution

  • Just to build on the answer @dmi3kno displayed above, I can then subset sites which contain both a "before" and "after" sign for each site.

    Output_df <- df1 %>% gather(key=Site, value=Pollution, -Dates) %>% 
    group_by(Site) %>% 
    mutate(HighLimit=as.numeric(Pollution>1.5*median(Pollution))) %>% 
    filter(HighLimit==1) %>% 
    left_join(df2, by=c("Site")) %>% 
    mutate(Time_Lag = as.numeric(as.Date(Survey_Dates)-as.Date(Dates)),
         Been_Before = ifelse(Time_Lag>0, "after", "before")) %>% 
    filter(abs(Time_Lag)<365) %>% 
    group_by(Site, Survey_Dates, Been_Before) %>% 
    summarise(Event_date_min=min(Dates), 
            Event_date_max=max(Dates))
    

    Then using dplyr again:

    Final_df <- Output_df %>%
    group_by(Site) %>%
    filter(all(c("before", "after") %in% Been_Before))