Search code examples
rdifflag

How to create a new variable from values ​in the same column?


My goal is to create a new variable that represents the difference between two dates in the same column in longitudinal data

animal   data        new_var
1      15/03/2020      NA
1      18/03/2020      3
1      18/04/2020      30     
1      20/04/2020      2
2      13/01/2020      NA
2      18/01/2020      5
2      25/01/2020      7
2      25/03/2020      30

The new_var is the difference (in days) between the two consecutive dates for the same animal. The file was previously sorted by animal and date.

I thought of the following solution:

 animal   data           data2           new_var
    1      15/03/2020      .                 .
    1      18/03/2020    15/03/2020          3
    1      18/04/2020    18/03/2020         30     
    1      20/04/2020    18/04/2020          2
    2      13/01/2020       .               NA
    2      18/01/2020    13/01/2020          5
    2      25/01/2020    18/01/2020          7
    2      25/03/2020    25/01/2020         60

I try with diff function but I am receiving the error message when I try this:

df$data2 <- diff(df$data, lag=1) df$new_var <- df$data - df$data2

I hope I am clear in conveying my message. If not I think the small piece of example code and how I'd like to extend it should be clear enough. Looking forward to suggestions.


Solution

  • diff returns a length 1 less than the original data column length. We need to append a value at the beginning or end to correct it. Also, it may need a grouping by 'animal'

    library(dplyr)
    library(lubridate)
    df %>% 
       group_by(animal) %>%
       mutate(new_var = as.numeric(c(NA, diff(dmy(data))))) %>%
       ungroup
    

    -output

    # A tibble: 8 x 3
    #  animal data       new_var
    #   <int> <chr>        <dbl>
    #1      1 15/03/2020      NA
    #2      1 18/03/2020       3
    #3      1 18/04/2020      31
    #4      1 20/04/2020       2
    #5      2 13/01/2020      NA
    #6      2 18/01/2020       5
    #7      2 25/01/2020       7
    #8      2 25/03/2020      60
    

    data

    df <- structure(list(animal = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), data = c("15/03/2020", 
    "18/03/2020", "18/04/2020", "20/04/2020", "13/01/2020", "18/01/2020", 
    "25/01/2020", "25/03/2020")), row.names = c(NA, -8L), class = "data.frame")