Search code examples
rdatedplyr

Calculate days from start in R


I'm looking for a way, to calculate the number of days a participant (id) spent in a study.

An exemplary data file looks like this:

data <- data.frame(date = as.Date(c("2020-11-29", "2020-11-30", "2020-12-02", 
                                    "2020-12-04", "2020-12-05", "2020-12-08",
                                    "2020-11-22", "2020-11-21", "2020-11-24", 
                                    "2020-11-25", "2020-11-30", "2020-11-29",
                                    "2021-01-29", "2021-01-20", "2021-01-30", 
                                    "2021-02-01", "2021-02-04", "2021-02-04")),
                   id = rep(1:3, each = 6))

data <- dplyr::arrange(data, id, date)

data

         date id
1  2020-11-29  1
2  2020-11-30  1
3  2020-12-02  1
4  2020-12-04  1
5  2020-12-05  1
6  2020-12-08  1
7  2020-11-21  2
8  2020-11-22  2
9  2020-11-24  2
10 2020-11-25  2
11 2020-11-29  2
12 2020-11-30  2
13 2021-01-20  3
14 2021-01-29  3
15 2021-01-30  3
16 2021-02-01  3
17 2021-02-04  3
18 2021-02-04  3

What i'd like to have, is new column days_from_start that will take the 1st day for every id and set it to 0. Then it will compute number of days for every other row within each id. Something like this:

data$days_from_start <- c(0, 1, 3, 4, 5, 8,
                          0, 1, 3, 4, 8, 10, 
                          0, 9, 10, 11, 14, 14)

data

         date id days_from_start
1  2020-11-29  1               0
2  2020-11-30  1               1
3  2020-12-02  1               3
4  2020-12-04  1               4
5  2020-12-05  1               5
6  2020-12-08  1               8
7  2020-11-21  2               0
8  2020-11-22  2               1
9  2020-11-24  2               3
10 2020-11-25  2               4
11 2020-11-29  2               8
12 2020-11-30  2              10
13 2021-01-20  3               0
14 2021-01-29  3               9
15 2021-01-30  3              10
16 2021-02-01  3              11
17 2021-02-04  3              14
18 2021-02-04  3              14

Any ideas?

Thank you


Solution

  • Simply group the data, work out the earliest date for each id and then calculate differences.

    data <- dplyr::arrange(data, id, date)
     data %>%
       group_by(id) %>% 
       mutate(
         start_date=min(date),
         days_from_start=as.numeric(date-start_date)
       ) %>% 
       ungroup() %>% 
       select(-start_date)
    # A tibble: 18 x 3
       date          id days_from_start
       <date>     <int>           <dbl>
     1 2020-11-29     1               0
     2 2020-11-30     1               1
     3 2020-12-02     1               3
     4 2020-12-04     1               5
     5 2020-12-05     1               6
     6 2020-12-08     1               9
     7 2020-11-21     2               0
     8 2020-11-22     2               1
     9 2020-11-24     2               3
    10 2020-11-25     2               4
    11 2020-11-29     2               8
    12 2020-11-30     2               9
    13 2021-01-20     3               0
    14 2021-01-29     3               9
    15 2021-01-30     3              10
    16 2021-02-01     3              12
    17 2021-02-04     3              15
    18 2021-02-04     3              15