I have a set of data with precipitation totals (Precip) for every day. I collected data at multiple pools. I need to calculate the number of days it has been since the last time it raied >12mm. If it rained over 12mm that day then it get a '0'.
## creates example dataframe
Pool <- c("A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B")
DATE <- as.Date(c("2005-01-01","2005-01-02","2005-01-03","2005-01-04","2005-01-05",
"2005-01-06","2005-01-07","2005-01-08","2005-01-09","2005-01-10",
"2005-01-01","2005-01-02","2005-01-03","2005-01-04","2005-01-05",
"2005-01-06","2005-01-07","2005-01-08","2005-01-09","2005-01-10"))
Precip <- c(0,0,3,18,4,3,13,8,3,0,13,0,3,13,0,3,10,8,13,0))
df <- data.frame(Pool, DATE, Precip)
I need the following dataframe:
Pool DATE Precip Days_since12
1 A 2005-01-01 0 NA
2 A 2005-01-02 0 NA
3 A 2005-01-03 3 NA
4 A 2005-01-04 18 NA
5 A 2005-01-05 4 1
6 A 2005-01-06 3 2
7 A 2005-01-07 13 0
8 A 2005-01-08 8 1
9 A 2005-01-09 3 2
10 A 2005-01-10 0 3
11 B 2005-01-01 13 0
12 B 2005-01-02 0 1
13 B 2005-01-03 3 2
14 B 2005-01-04 13 0
15 B 2005-01-05 0 1
16 B 2005-01-06 3 2
17 B 2005-01-07 10 3
18 B 2005-01-08 8 4
19 B 2005-01-09 13 0
20 B 2005-01-10 0 1
I can easily add a column to indicate if the precip was over 12 or not:
df2 <- df %>%
group_by(Pool) %>%
mutate(pre12=ifelse(Precip>12,1,0))
But I am then not sure how to calculate the number of days between the DATE and the previous date when pre12==1
Many of the answers here ignore the value in the date column, instead revert to "counting rows," and therefore require very clean data--which your example provides-- one row per day, no missing values, everything in order.
Using the dates directly provides a more robust approach. Here's one way:
You've got a good start with if_else()
, but we can go further. Instead of simply capturing a true/false there, let's record the actual date for any day with precipitation > 12. This is the first step in determining the most recent day with precipitation > 12.
library(dplyr)
df %>%
group_by(Pool) %>%
mutate(latest_12 = if_else(Precip > 12, DATE, NA) %>% as.Date())
#> # A tibble: 20 × 4
#> # Groups: Pool [2]
#> Pool DATE Precip latest_12
#> <chr> <date> <dbl> <date>
#> 1 A 2005-01-01 0 NA
#> 2 A 2005-01-02 0 NA
#> 3 A 2005-01-03 3 NA
#> 4 A 2005-01-04 18 2005-01-04
#> 5 A 2005-01-05 4 NA
#> 6 A 2005-01-06 3 NA
#> 7 A 2005-01-07 13 2005-01-07
#> 8 A 2005-01-08 8 NA
#> 9 A 2005-01-09 3 NA
#> 10 A 2005-01-10 0 NA
#> 11 B 2005-01-01 13 2005-01-01
#> 12 B 2005-01-02 0 NA
#> 13 B 2005-01-03 3 NA
#> 14 B 2005-01-04 13 2005-01-04
#> 15 B 2005-01-05 0 NA
#> 16 B 2005-01-06 3 NA
#> 17 B 2005-01-07 10 NA
#> 18 B 2005-01-08 8 NA
#> 19 B 2005-01-09 13 2005-01-09
#> 20 B 2005-01-10 0 NA
Then we can use tidyr::fill()
(which is smart enough to stay within groups defined by group_by()) to fill down the last non-missing value into all NAs, to get a column of the latest date with precip > 12 for a given pool. The data needs to be in increasing order here for the fill to do what we want, so don't forget an arrange
to cover your backside!
library(tidyr)
df %>%
group_by(Pool) %>%
mutate(latest_12 = if_else(Precip > 12, DATE, NA) %>% as.Date()) %>%
arrange(Pool, DATE) %>%
fill(latest_12, .direction = "down")
#> # A tibble: 20 × 4
#> # Groups: Pool [2]
#> Pool DATE Precip latest_12
#> <chr> <date> <dbl> <date>
#> 1 A 2005-01-01 0 NA
#> 2 A 2005-01-02 0 NA
#> 3 A 2005-01-03 3 NA
#> 4 A 2005-01-04 18 2005-01-04
#> 5 A 2005-01-05 4 2005-01-04
#> 6 A 2005-01-06 3 2005-01-04
#> 7 A 2005-01-07 13 2005-01-07
#> 8 A 2005-01-08 8 2005-01-07
#> 9 A 2005-01-09 3 2005-01-07
#> 10 A 2005-01-10 0 2005-01-07
#> 11 B 2005-01-01 13 2005-01-01
#> 12 B 2005-01-02 0 2005-01-01
#> 13 B 2005-01-03 3 2005-01-01
#> 14 B 2005-01-04 13 2005-01-04
#> 15 B 2005-01-05 0 2005-01-04
#> 16 B 2005-01-06 3 2005-01-04
#> 17 B 2005-01-07 10 2005-01-04
#> 18 B 2005-01-08 8 2005-01-04
#> 19 B 2005-01-09 13 2005-01-09
#> 20 B 2005-01-10 0 2005-01-09
Then we can use actual date math on these two columns to compute the elapsed time since the last rain.
df %>%
group_by(Pool) %>%
mutate(latest_12 = if_else(Precip > 12, DATE, NA) %>% as.Date()) %>%
arrange(Pool, DATE) %>%
fill(latest_12, .direction = "down") %>%
mutate(days_since_12 = DATE - latest_12)
#> # A tibble: 20 × 5
#> # Groups: Pool [2]
#> Pool DATE Precip latest_12 days_since_12
#> <chr> <date> <dbl> <date> <drtn>
#> 1 A 2005-01-01 0 NA NA days
#> 2 A 2005-01-02 0 NA NA days
#> 3 A 2005-01-03 3 NA NA days
#> 4 A 2005-01-04 18 2005-01-04 0 days
#> 5 A 2005-01-05 4 2005-01-04 1 days
#> 6 A 2005-01-06 3 2005-01-04 2 days
#> 7 A 2005-01-07 13 2005-01-07 0 days
#> 8 A 2005-01-08 8 2005-01-07 1 days
#> 9 A 2005-01-09 3 2005-01-07 2 days
#> 10 A 2005-01-10 0 2005-01-07 3 days
#> 11 B 2005-01-01 13 2005-01-01 0 days
#> 12 B 2005-01-02 0 2005-01-01 1 days
#> 13 B 2005-01-03 3 2005-01-01 2 days
#> 14 B 2005-01-04 13 2005-01-04 0 days
#> 15 B 2005-01-05 0 2005-01-04 1 days
#> 16 B 2005-01-06 3 2005-01-04 2 days
#> 17 B 2005-01-07 10 2005-01-04 3 days
#> 18 B 2005-01-08 8 2005-01-04 4 days
#> 19 B 2005-01-09 13 2005-01-09 0 days
#> 20 B 2005-01-10 0 2005-01-09 1 days
Because we're doing date math, and we've arranged our data, this approach is robust to missing and out-of-order data.
# change the date of the 5th row from 2005-01-05 to 2005-01-15,
# creating a missing value and an out-of-order value
df$DATE[5] <- as.Date("2005-01-15")
df %>%
group_by(Pool) %>%
mutate(latest_12 = if_else(Precip > 12, DATE, NA) %>% as.Date()) %>%
arrange(Pool, DATE) %>%
fill(latest_12, .direction = "down") %>%
mutate(days_since_12 = DATE - latest_12)
#> # A tibble: 20 × 5
#> # Groups: Pool [2]
#> Pool DATE Precip latest_12 days_since_12
#> <chr> <date> <dbl> <date> <drtn>
#> 1 A 2005-01-01 0 NA NA days
#> 2 A 2005-01-02 0 NA NA days
#> 3 A 2005-01-03 3 NA NA days
#> 4 A 2005-01-04 18 2005-01-04 0 days
#> 5 A 2005-01-06 3 2005-01-04 2 days
#> 6 A 2005-01-07 13 2005-01-07 0 days
#> 7 A 2005-01-08 8 2005-01-07 1 days
#> 8 A 2005-01-09 3 2005-01-07 2 days
#> 9 A 2005-01-10 0 2005-01-07 3 days
#> 10 A 2005-01-15 4 2005-01-07 8 days
#> 11 B 2005-01-01 13 2005-01-01 0 days
#> 12 B 2005-01-02 0 2005-01-01 1 days
#> 13 B 2005-01-03 3 2005-01-01 2 days
#> 14 B 2005-01-04 13 2005-01-04 0 days
#> 15 B 2005-01-05 0 2005-01-04 1 days
#> 16 B 2005-01-06 3 2005-01-04 2 days
#> 17 B 2005-01-07 10 2005-01-04 3 days
#> 18 B 2005-01-08 8 2005-01-04 4 days
#> 19 B 2005-01-09 13 2005-01-09 0 days
#> 20 B 2005-01-10 0 2005-01-09 1 days