Search code examples
rdplyrlubridate

I need to count the number of days between a date and the last day that precipitation was greater than 12


I have a set of data with precipitation totals (Precip) for every day. I collected data at multiple pools. I need to calculate the number of days it has been since the last time it raied >12mm. If it rained over 12mm that day then it get a '0'.

## creates example dataframe
Pool <- c("A","A","A","A","A","A","A","A","A","A",
        "B","B","B","B","B","B","B","B","B","B")
DATE <- as.Date(c("2005-01-01","2005-01-02","2005-01-03","2005-01-04","2005-01-05",
                        "2005-01-06","2005-01-07","2005-01-08","2005-01-09","2005-01-10",
                        "2005-01-01","2005-01-02","2005-01-03","2005-01-04","2005-01-05",
                        "2005-01-06","2005-01-07","2005-01-08","2005-01-09","2005-01-10"))
Precip <- c(0,0,3,18,4,3,13,8,3,0,13,0,3,13,0,3,10,8,13,0))

df <- data.frame(Pool, DATE, Precip)

I need the following dataframe:

   Pool       DATE Precip  Days_since12
1     A 2005-01-01      0     NA
2     A 2005-01-02      0     NA
3     A 2005-01-03      3     NA
4     A 2005-01-04     18     NA
5     A 2005-01-05      4     1
6     A 2005-01-06      3     2
7     A 2005-01-07     13     0
8     A 2005-01-08      8     1
9     A 2005-01-09      3     2
10    A 2005-01-10      0     3
11    B 2005-01-01     13     0
12    B 2005-01-02      0     1
13    B 2005-01-03      3     2
14    B 2005-01-04     13     0
15    B 2005-01-05      0     1
16    B 2005-01-06      3     2
17    B 2005-01-07     10     3
18    B 2005-01-08      8     4
19    B 2005-01-09     13     0
20    B 2005-01-10      0     1
 

I can easily add a column to indicate if the precip was over 12 or not:

df2 <- df %>%
  group_by(Pool) %>%
  mutate(pre12=ifelse(Precip>12,1,0))

But I am then not sure how to calculate the number of days between the DATE and the previous date when pre12==1


Solution

  • Many of the answers here ignore the value in the date column, instead revert to "counting rows," and therefore require very clean data--which your example provides-- one row per day, no missing values, everything in order.

    Using the dates directly provides a more robust approach. Here's one way:

    You've got a good start with if_else(), but we can go further. Instead of simply capturing a true/false there, let's record the actual date for any day with precipitation > 12. This is the first step in determining the most recent day with precipitation > 12.

    library(dplyr)
    
    df %>%
      group_by(Pool) %>%
      mutate(latest_12 = if_else(Precip > 12, DATE, NA) %>% as.Date())
    #> # A tibble: 20 × 4
    #> # Groups:   Pool [2]
    #>    Pool  DATE       Precip latest_12 
    #>    <chr> <date>      <dbl> <date>    
    #>  1 A     2005-01-01      0 NA        
    #>  2 A     2005-01-02      0 NA        
    #>  3 A     2005-01-03      3 NA        
    #>  4 A     2005-01-04     18 2005-01-04
    #>  5 A     2005-01-05      4 NA        
    #>  6 A     2005-01-06      3 NA        
    #>  7 A     2005-01-07     13 2005-01-07
    #>  8 A     2005-01-08      8 NA        
    #>  9 A     2005-01-09      3 NA        
    #> 10 A     2005-01-10      0 NA        
    #> 11 B     2005-01-01     13 2005-01-01
    #> 12 B     2005-01-02      0 NA        
    #> 13 B     2005-01-03      3 NA        
    #> 14 B     2005-01-04     13 2005-01-04
    #> 15 B     2005-01-05      0 NA        
    #> 16 B     2005-01-06      3 NA        
    #> 17 B     2005-01-07     10 NA        
    #> 18 B     2005-01-08      8 NA        
    #> 19 B     2005-01-09     13 2005-01-09
    #> 20 B     2005-01-10      0 NA
    

    Then we can use tidyr::fill() (which is smart enough to stay within groups defined by group_by()) to fill down the last non-missing value into all NAs, to get a column of the latest date with precip > 12 for a given pool. The data needs to be in increasing order here for the fill to do what we want, so don't forget an arrange to cover your backside!

    library(tidyr)
    df %>%
      group_by(Pool) %>%
      mutate(latest_12 = if_else(Precip > 12, DATE, NA) %>% as.Date()) %>%
      arrange(Pool, DATE) %>%
      fill(latest_12, .direction = "down")
    #> # A tibble: 20 × 4
    #> # Groups:   Pool [2]
    #>    Pool  DATE       Precip latest_12 
    #>    <chr> <date>      <dbl> <date>    
    #>  1 A     2005-01-01      0 NA        
    #>  2 A     2005-01-02      0 NA        
    #>  3 A     2005-01-03      3 NA        
    #>  4 A     2005-01-04     18 2005-01-04
    #>  5 A     2005-01-05      4 2005-01-04
    #>  6 A     2005-01-06      3 2005-01-04
    #>  7 A     2005-01-07     13 2005-01-07
    #>  8 A     2005-01-08      8 2005-01-07
    #>  9 A     2005-01-09      3 2005-01-07
    #> 10 A     2005-01-10      0 2005-01-07
    #> 11 B     2005-01-01     13 2005-01-01
    #> 12 B     2005-01-02      0 2005-01-01
    #> 13 B     2005-01-03      3 2005-01-01
    #> 14 B     2005-01-04     13 2005-01-04
    #> 15 B     2005-01-05      0 2005-01-04
    #> 16 B     2005-01-06      3 2005-01-04
    #> 17 B     2005-01-07     10 2005-01-04
    #> 18 B     2005-01-08      8 2005-01-04
    #> 19 B     2005-01-09     13 2005-01-09
    #> 20 B     2005-01-10      0 2005-01-09
    

    Then we can use actual date math on these two columns to compute the elapsed time since the last rain.

    df %>%
      group_by(Pool) %>%
      mutate(latest_12 = if_else(Precip > 12, DATE, NA) %>% as.Date()) %>%
      arrange(Pool, DATE) %>%
      fill(latest_12, .direction = "down") %>%
      mutate(days_since_12 = DATE - latest_12)
    #> # A tibble: 20 × 5
    #> # Groups:   Pool [2]
    #>    Pool  DATE       Precip latest_12  days_since_12
    #>    <chr> <date>      <dbl> <date>     <drtn>       
    #>  1 A     2005-01-01      0 NA         NA days      
    #>  2 A     2005-01-02      0 NA         NA days      
    #>  3 A     2005-01-03      3 NA         NA days      
    #>  4 A     2005-01-04     18 2005-01-04  0 days      
    #>  5 A     2005-01-05      4 2005-01-04  1 days      
    #>  6 A     2005-01-06      3 2005-01-04  2 days      
    #>  7 A     2005-01-07     13 2005-01-07  0 days      
    #>  8 A     2005-01-08      8 2005-01-07  1 days      
    #>  9 A     2005-01-09      3 2005-01-07  2 days      
    #> 10 A     2005-01-10      0 2005-01-07  3 days      
    #> 11 B     2005-01-01     13 2005-01-01  0 days      
    #> 12 B     2005-01-02      0 2005-01-01  1 days      
    #> 13 B     2005-01-03      3 2005-01-01  2 days      
    #> 14 B     2005-01-04     13 2005-01-04  0 days      
    #> 15 B     2005-01-05      0 2005-01-04  1 days      
    #> 16 B     2005-01-06      3 2005-01-04  2 days      
    #> 17 B     2005-01-07     10 2005-01-04  3 days      
    #> 18 B     2005-01-08      8 2005-01-04  4 days      
    #> 19 B     2005-01-09     13 2005-01-09  0 days      
    #> 20 B     2005-01-10      0 2005-01-09  1 days
    

    Because we're doing date math, and we've arranged our data, this approach is robust to missing and out-of-order data.

    # change the date of the 5th row from 2005-01-05 to 2005-01-15, 
    # creating a missing value and an out-of-order value
    df$DATE[5] <- as.Date("2005-01-15") 
    
    df %>%
      group_by(Pool) %>%
      mutate(latest_12 = if_else(Precip > 12, DATE, NA) %>% as.Date()) %>%
      arrange(Pool, DATE) %>%
      fill(latest_12, .direction = "down") %>%
      mutate(days_since_12 = DATE - latest_12)
    #> # A tibble: 20 × 5
    #> # Groups:   Pool [2]
    #>    Pool  DATE       Precip latest_12  days_since_12
    #>    <chr> <date>      <dbl> <date>     <drtn>       
    #>  1 A     2005-01-01      0 NA         NA days      
    #>  2 A     2005-01-02      0 NA         NA days      
    #>  3 A     2005-01-03      3 NA         NA days      
    #>  4 A     2005-01-04     18 2005-01-04  0 days      
    #>  5 A     2005-01-06      3 2005-01-04  2 days      
    #>  6 A     2005-01-07     13 2005-01-07  0 days      
    #>  7 A     2005-01-08      8 2005-01-07  1 days      
    #>  8 A     2005-01-09      3 2005-01-07  2 days      
    #>  9 A     2005-01-10      0 2005-01-07  3 days      
    #> 10 A     2005-01-15      4 2005-01-07  8 days      
    #> 11 B     2005-01-01     13 2005-01-01  0 days      
    #> 12 B     2005-01-02      0 2005-01-01  1 days      
    #> 13 B     2005-01-03      3 2005-01-01  2 days      
    #> 14 B     2005-01-04     13 2005-01-04  0 days      
    #> 15 B     2005-01-05      0 2005-01-04  1 days      
    #> 16 B     2005-01-06      3 2005-01-04  2 days      
    #> 17 B     2005-01-07     10 2005-01-04  3 days      
    #> 18 B     2005-01-08      8 2005-01-04  4 days      
    #> 19 B     2005-01-09     13 2005-01-09  0 days      
    #> 20 B     2005-01-10      0 2005-01-09  1 days