Search code examples
rgroup-bydiffdays

R: take one record per 30 days within each group


I have a dataset with over 1000 unique IDs and for each ID about 15 Surgery Codes done on different Dates(recorded as Days Diff)

I want to take only 1 record per 30 days within the group of each surgery code for each ID.

Adding a demo data here:

   ID Age Diag.Date Surgery.Code Days.diff
1   1  67  4/8/2011         A364       421
2   1  67  4/8/2011         A364      1197
3   1  67  4/8/2011         A364      2207
4   1  67  4/8/2011         A364      2226
5   1  67  4/8/2011         A364      2247
6   1  67  4/8/2011         A364      2254
7   1  67  4/8/2011         A364      2331
8   1  67  4/8/2011         A364      2367
9   1  67  4/8/2011         A364      2905
10  1  67  4/8/2011         A364      2918
11  1  67  4/8/2011         D365      2200
12  1  67  4/8/2011         D441       308
13  1  67  4/8/2011         D443       218
14  1  67  4/8/2011         A446       308
15  2  56  6/4/2018         A453      2260
16  2  56  6/4/2018         D453       645
17  2  56  6/4/2018         D453      3095
18  2  56  6/4/2018         B453       645

Diff of 2226-2207 days is 19 days so row4 will delete, again diff of 2247-2207 days is 40 days so row5 will get recorded. Again diff of 2254-2247 days is 7 days so row6 will get deleted. Similarly, row10 will get deleted.

Any help is appreciated!


Solution

    1. Use dplyr::group_by(ID, Surgery.Code) to filter within individuals and surgeries;
    2. Within each group, use Days.diff - dplyr::lag(Days.diff) <= 30 to test for adjacent rows within 30 days;
    3. Because the results of (2) may change when rows are removed, you'll want to iterate by removing one row at a time per group, then re-testing. You can use while to iterate until no more cases are detected.
    library(dplyr)
    
    filtered <- surgeries %>% 
      group_by(ID, Surgery.Code) %>% 
      mutate(within30 = if_else(
        Days.diff - lag(Days.diff) <= 30, 
        row_number(), 
        NA_integer_
      ))
    
    while (any(!is.na(filtered$within30))) {
      filtered <- filtered %>% 
        mutate(within30 = if_else(
          Days.diff - lag(Days.diff) <= 30, 
          row_number(), 
          NA_integer_
        )) %>% 
        filter(is.na(within30) | within30 != min(within30, na.rm = TRUE))
    }
    
    filtered %>% 
      select(!within30) %>% 
      ungroup()
    
    #> # A tibble: 15 x 5
    #>       ID   Age Diag.Date Surgery.Code Days.diff
    #>    <int> <int> <chr>     <chr>            <int>
    #>  1     1    67 4/8/2011  A364               421
    #>  2     1    67 4/8/2011  A364              1197
    #>  3     1    67 4/8/2011  A364              2207
    #>  4     1    67 4/8/2011  A364              2247
    #>  5     1    67 4/8/2011  A364              2331
    #>  6     1    67 4/8/2011  A364              2367
    #>  7     1    67 4/8/2011  A364              2905
    #>  8     1    67 4/8/2011  D365              2200
    #>  9     1    67 4/8/2011  D441               308
    #> 10     1    67 4/8/2011  D443               218
    #> 11     1    67 4/8/2011  A446               308
    #> 12     2    56 6/4/2018  A453              2260
    #> 13     2    56 6/4/2018  D453               645
    #> 14     2    56 6/4/2018  D453              3095
    #> 15     2    56 6/4/2018  B453               645
    

    Created on 2022-03-01 by the reprex package (v2.0.1)