For the sake of a hypothetical: imagine that I'm an avid sportsman. I have a dataset that logs every time I go boating/skiing/whatever. I want to calculate the number of days that have passed since the last hypothetical accident that occurred in each given locale.
Here is a small reproducible set of my data:
mydata <- data.frame(state = c(rep("Vermont", 5), rep("New Hampshire", 5)),
date = c("2016-01-01", "2016-01-03", "2016-01-04", "2016-01-04", "2016-02-01",
"2016-01-03", "2016-01-15", "2016-01-16", "2016-02-01", "2016-02-03"),
accident = c(1, 0, 0, 1, 1,
0, 1, 1, 0, 1))
Here's what it looks like. Notice that the dates are "ragged" - I sometimes ski/boat two days in a row, and other times I take a week-long break. Also notice that I'm doing these sports activities in multiple states, and I want to group_by
the state, resulting in an NA
value for the first time I ski/boat in that area.
state date accident
Vermont 2016-01-01 1
Vermont 2016-01-02 0
Vermont 2016-01-03 0
Vermont 2016-01-04 1
Vermont 2016-02-01 1
New Hampshire 2016-01-03 0
New Hampshire 2016-01-15 1
New Hampshire 2016-01-16 1
New Hampshire 2016-02-01 0
New Hampshire 2016-02-03 1
I want to produce this:
state date accident numdays
Vermont 2016-01-01 1 NA
Vermont 2016-01-02 0 1
Vermont 2016-01-03 0 2
Vermont 2016-01-04 1 3
Vermont 2016-02-01 1 28
New Hampshire 2016-01-03 0 NA
New Hampshire 2016-01-15 1 NA
New Hampshire 2016-01-16 1 1
New Hampshire 2016-02-01 0 16
New Hampshire 2016-02-03 1 18
*Edited to correct typo in data
Here is one option :
Create a new column which has dates when an accident occurs. Use tidyr::fill
to fill the days when accident does not occur with the previous value. For each state
calculate number of days passed since the last accident occur.
library(dplyr)
mydata %>%
mutate(date = as.Date(date),
numdays = replace(date, accident == 0, NA)) %>%
group_by(state) %>%
mutate(numdays = lag(numdays)) %>%
tidyr::fill(numdays) %>%
mutate(numdays = as.integer(date - numdays))
# state date accident numdays
# <chr> <date> <dbl> <int>
# 1 Vermont 2016-01-01 1 NA
# 2 Vermont 2016-01-02 0 1
# 3 Vermont 2016-01-03 0 2
# 4 Vermont 2016-01-04 1 3
# 5 Vermont 2016-02-01 1 28
# 6 New Hampshire 2016-01-03 0 NA
# 7 New Hampshire 2016-01-15 1 NA
# 8 New Hampshire 2016-01-16 1 1
# 9 New Hampshire 2016-02-01 0 16
#10 New Hampshire 2016-02-03 1 18
data
There was probably a typo in few of date
entries which I have corrected below.
mydata <- data.frame(state = c(rep("Vermont", 5), rep("New Hampshire", 5)),
date = c("2016-01-01", "2016-01-02", "2016-01-03", "2016-01-04", "2016-02-01",
"2016-01-03", "2016-01-15", "2016-01-16", "2016-02-01", "2016-02-03"),
accident = c(1, 0, 0, 1, 1,
0, 1, 1, 0, 1))