Search code examples
rdplyrdata.tableplyr

Increment indicator column in R


I have below data, and I am trying to create a indicator column to keep a track of increment in the value by id.

df <- data.frame(id=c(1,1,1,1,1,1,1,2,2,2,3,3,4),
         rank=c(1,2,3,1,1,1,2,2,3,1,1,2,3),dates=c('2019-06-15','2019-07-15','2019-08-15','2019-09-15','2019-10-15','2019-11-15','2019-12-15',
                 '2019-10-15','2019-11-15','2019-12-15',
                 '2019-11-15','2019-12-15','2019-12-15'),new_col=c(0,1,1,1,1,1,1,0,1,1,0,1,0)
         )

The order of the date matters here.

setDT(df)[order(id,dates),]

Desired Output

   id  rank   dates     new_col
1   1    1   2019-06-15       0
2   1    2   2019-07-15       1
3   1    3   2019-08-15       1
4   1    1   2019-09-15       1
5   1    1   2019-10-15       1
6   1    1   2019-11-15       1
7   1    2   2019-12-15       1
8   2    2   2019-10-15       0
9   2    3   2019-11-15       1
10  2    1   2019-12-15       1
11  3    1   2019-11-15       0
12  3    2   2019-12-15       1
13  4    3   2019-12-15       0

Updated input data-

df <- data.frame(id=c(1,1,1,1,1,1,1,2,2,2,3,3,4,5,5,5),
             rank=c(1,2,3,1,1,1,2,2,3,1,1,2,3,1,1,1),dates=c('2019-06-15','2019-07-10','2019-08-15','2019-09-15','2019-10-15','2019-11-15','2019-12-15',
                     '2019-10-15','2019-11-15','2019-12-15',
                     '2019-11-15','2019-12-15','2019-12-15','2019-10-15','2019-11-15','2019-12-15'))

Updated Output-

      id   rank  dates       new

 1     1     1 2019-06-15     0
 2     1     2 2019-07-10     1
 3     1     3 2019-08-15     1
 4     1     1 2019-09-15     1
 5     1     1 2019-10-15     1
 6     1     1 2019-11-15     1
 7     1     2 2019-12-15     1
 8     2     2 2019-10-15     0
 9     2     3 2019-11-15     1
10     2     1 2019-12-15     1
11     3     1 2019-11-15     0
12     3     2 2019-12-15     1
13     4     3 2019-12-15     0
14     5     1 2019-10-15     0
15     5     1 2019-11-15     0
16     5     1 2019-12-15     0

Solution

  • We can use diff on the 'dates' after grouping by 'id'

    library(dplyr)
    df %>% 
       group_by(id) %>%       
       mutate(new = c(0, diff(as.Date(dates)) > 0))
    # A tibble: 13 x 5
    # Groups:   id [4]
    #      id  rank dates      new_col   new
    #   <dbl> <dbl> <fct>        <dbl> <dbl>
    # 1     1     1 2019-06-15       0     0
    # 2     1     2 2019-07-15       1     1
    # 3     1     3 2019-08-15       1     1
    # 4     1     1 2019-09-15       1     1
    # 5     1     1 2019-10-15       1     1
    # 6     1     1 2019-11-15       1     1
    # 7     1     2 2019-12-15       1     1
    # 8     2     2 2019-10-15       0     0
    # 9     2     3 2019-11-15       1     1
    #10     2     1 2019-12-15       1     1
    #11     3     1 2019-11-15       0     0
    #12     3     2 2019-12-15       1     1
    #13     4     3 2019-12-15       0     0
    

    Update

    df %>% 
        group_by(id) %>% 
        mutate(new = +(c(FALSE, diff(as.Date(dates)) > 0) & 
           (any(rank  != lag(rank, default = first(rank))))))
    # A tibble: 16 x 4
    # Groups:   id [5]
    #      id  rank dates        new
    #   <dbl> <dbl> <fct>      <int>
    # 1     1     1 2019-06-15     0
    # 2     1     2 2019-07-10     1
    # 3     1     3 2019-08-15     1
    # 4     1     1 2019-09-15     1
    # 5     1     1 2019-10-15     1
    # 6     1     1 2019-11-15     1
    # 7     1     2 2019-12-15     1
    # 8     2     2 2019-10-15     0
    # 9     2     3 2019-11-15     1
    #10     2     1 2019-12-15     1
    #11     3     1 2019-11-15     0
    #12     3     2 2019-12-15     1
    #13     4     3 2019-12-15     0
    #14     5     1 2019-10-15     0
    #15     5     1 2019-11-15     0
    #16     5     1 2019-12-15     0