Search code examples
rdplyr

Remove extra measurements within specific brackets per ID in R


I have a long dataset in which each participant should have 9 measurements, but they all have more than 9. These measurements need to be within specific brackets of the Weeks variable.

  • Measurement 1 should be < 12 weeks (ideal 10)
  • Measurement 2 should be >=12 and <25 weeks (ideal 16)
  • Measurement 3 should be >=25 and <28 weeks (ideal 25)
  • Measurement 4 should be >=28 and <31 weeks (ideal 28)
  • Measurement 5 should be >=31 and <34 weeks (ideal 31)
  • Measurement 6 should be >=34 and <36 weeks (ideal 34)
  • Measurement 7 should be >=36 and <38 weeks (ideal 36)
  • Measurement 8 should be >=38 and <40 weeks (ideal 38)
  • Measurement 9 should be >=40 weeks (ideal 40)

If there is more than one measurement in the bracket per participant, I would like the measurement closest to the ideal number of weeks (indicated above in parentheses) to remain, and the rest of the rows to be removed.

This an example of my data:

ID <- c(1,1,1,1,1,1,1,1,1,1,
    2,2,2,2,2,2,2,2,2,2,2,2,2)
Weeks <- c(11,16,25,28,31,34,35,36,38,40,
       6,12,13,26,29,31,32,33,34,36,38,40,41)
Measurement <- c(120,160,170,200,120,40,89,70,63,44,
             90,92,93,94,95,96,97,100,100, 120,
             110, 100, 130)
df1 <- data.frame(ID, Weeks, Measurement)

DF1

This is what I would like it to look like, with rows 7, 12, 17, 18, and 23 removed.

DF2

Thank you for any help!


Solution

  • Since you have tagged this with dplyr, here is a dplyr solution. It is based on sorting, so it will only ever retain one row per participant and week slot.

    If it is possible for two measurements to happen in the same week, you may want to consider a way to resolve that situation, for example by averaging the measurements from the same week before discarding observations.

    library(dplyr)
    
    # Data
    df1 <- data.frame(
      ID = rep(c(1, 2), c(10L, 13L)),
      Weeks = c(11, 16, 25, 28, 31, 34, 35, 36, 38, 40, 6, 12, 13, 26, 29, 31, 32, 33, 34, 36, 38, 40, 41),
      Measurement = c(
        120, 160, 170, 200, 120, 40, 89, 70, 63, 44, 90, 92, 93, 94, 95, 96, 97, 100,
        100, 120, 110, 100, 130
      )
    )
    
    # Solution
    
    ideal = c(10, 16, 25, 28, 31, 34, 36, 38, 40)
    
    df1 |>
      mutate(slot = cut(Weeks,
                        breaks = c(0, 12, 25, 28, 31, 34, 36, 38, 40, Inf),
                        right = F
                        ),
             diff = abs(Weeks - sapply(as.integer(slot), \(s) ideal[s]))
             ) |>
      group_by(ID, slot) |>
      arrange(ID, slot, diff) |> 
      slice(1) |> 
      ungroup()
    #> # A tibble: 18 × 5
    #>       ID Weeks Measurement slot      diff
    #>    <dbl> <dbl>       <dbl> <fct>    <dbl>
    #>  1     1    11         120 [0,12)       1
    #>  2     1    16         160 [12,25)      0
    #>  3     1    25         170 [25,28)      0
    #>  4     1    28         200 [28,31)      0
    #>  5     1    31         120 [31,34)      0
    #>  6     1    34          40 [34,36)      0
    #>  7     1    36          70 [36,38)      0
    #>  8     1    38          63 [38,40)      0
    #>  9     1    40          44 [40,Inf)     0
    #> 10     2     6          90 [0,12)       4
    #> 11     2    13          93 [12,25)      3
    #> 12     2    26          94 [25,28)      1
    #> 13     2    29          95 [28,31)      1
    #> 14     2    31          96 [31,34)      0
    #> 15     2    34         100 [34,36)      0
    #> 16     2    36         120 [36,38)      0
    #> 17     2    38         110 [38,40)      0
    #> 18     2    40         100 [40,Inf)     0
    

    Created on 2024-07-14 with reprex v2.1.1