Search code examples
rdatelubridate

Calculating number of days using criterion


Overview

For the sake of a hypothetical: imagine that I'm an avid sportsman. I have a dataset that logs every time I go boating/skiing/whatever. I want to calculate the number of days that have passed since the last hypothetical accident that occurred in each given locale.

My data

Here is a small reproducible set of my data:

mydata <- data.frame(state = c(rep("Vermont", 5), rep("New Hampshire", 5)),
                     date = c("2016-01-01", "2016-01-03", "2016-01-04", "2016-01-04", "2016-02-01",
                              "2016-01-03", "2016-01-15", "2016-01-16", "2016-02-01", "2016-02-03"),
                     accident = c(1, 0, 0, 1, 1,
                                  0, 1, 1, 0, 1))

Here's what it looks like. Notice that the dates are "ragged" - I sometimes ski/boat two days in a row, and other times I take a week-long break. Also notice that I'm doing these sports activities in multiple states, and I want to group_by the state, resulting in an NA value for the first time I ski/boat in that area.

          state       date accident
        Vermont 2016-01-01        1
        Vermont 2016-01-02        0
        Vermont 2016-01-03        0
        Vermont 2016-01-04        1
        Vermont 2016-02-01        1
  New Hampshire 2016-01-03        0
  New Hampshire 2016-01-15        1
  New Hampshire 2016-01-16        1
  New Hampshire 2016-02-01        0
  New Hampshire 2016-02-03        1

I want to produce this:

          state       date accident   numdays
        Vermont 2016-01-01        1        NA
        Vermont 2016-01-02        0         1
        Vermont 2016-01-03        0         2
        Vermont 2016-01-04        1         3
        Vermont 2016-02-01        1        28
  New Hampshire 2016-01-03        0        NA
  New Hampshire 2016-01-15        1        NA
  New Hampshire 2016-01-16        1         1
  New Hampshire 2016-02-01        0        16
  New Hampshire 2016-02-03        1        18

*Edited to correct typo in data


Solution

  • Here is one option :

    Create a new column which has dates when an accident occurs. Use tidyr::fill to fill the days when accident does not occur with the previous value. For each state calculate number of days passed since the last accident occur.

    library(dplyr)
    
    mydata %>%
      mutate(date = as.Date(date), 
             numdays = replace(date, accident == 0, NA)) %>%
      group_by(state) %>%
      mutate(numdays = lag(numdays)) %>%
      tidyr::fill(numdays) %>%
      mutate(numdays = as.integer(date - numdays))
    
    #   state         date       accident numdays
    #   <chr>         <date>        <dbl>   <int>
    # 1 Vermont       2016-01-01        1      NA
    # 2 Vermont       2016-01-02        0       1
    # 3 Vermont       2016-01-03        0       2
    # 4 Vermont       2016-01-04        1       3
    # 5 Vermont       2016-02-01        1      28
    # 6 New Hampshire 2016-01-03        0      NA
    # 7 New Hampshire 2016-01-15        1      NA
    # 8 New Hampshire 2016-01-16        1       1
    # 9 New Hampshire 2016-02-01        0      16
    #10 New Hampshire 2016-02-03        1      18
    

    data

    There was probably a typo in few of date entries which I have corrected below.

    mydata <- data.frame(state = c(rep("Vermont", 5), rep("New Hampshire", 5)),
                         date = c("2016-01-01", "2016-01-02", "2016-01-03", "2016-01-04", "2016-02-01",
                                  "2016-01-03", "2016-01-15", "2016-01-16", "2016-02-01", "2016-02-03"),
                         accident = c(1, 0, 0, 1, 1,
                                      0, 1, 1, 0, 1))