Search code examples
rloopsgroup

loop to select the date closest to another date, for multiple variables


For each MRN ID, I would like to select the date (in OBSERVATION_DATE), in association with various columns (i.e. hdl, ldl, and vldl) that is closest to the dates listed in multiple other columns (i.e. Base, SixMonths, TwelveMonths, and TwentyFourMonths). So for example, for the rows that report HDL data, I would like to assess the dates found in OBSEVATION_DATE to find the date closest to Base, SixMonths, etc. There should be 4 new columns created for that example (one for each of the 4 time points listed). Additionally, for that example, I want another set of 4 columns created that have the value associated with the date that was identified in the other 4 new columns. So in total for my sample data,there should be 24 new columns (See below).

The algorithm should look for only the observations that have data in a given column as potential selections for dates that are closest to Base, SixMonths, TwelveMonths, and TwentyFourMonths. For example, if a row does not have data for HDL, it should not be considered as part of the selection pool to choose a date from. Not all dates given for a person will have data in all the columns I am concerned about. therefore, it would not be useful to rely solely on an algorithm that chooses a date for a person regardless of whether there is data for that variable at that date.

HDL OBSERVATION_DATE date closest to Base,
HDL OBSERVATION_DATE date closest to SixMonths,
HDL OBSERVATION_DATE date closest to TwelveMonths,
HDL OBSERVATION_DATE date closest to TwentyFourMonths,
HDL value associated with OBSERVATION_DATE date closest to Base,
HDL value associated with OBSERVATION_DATE date closest to SixMonths,
HDL value associated with OBSERVATION_DATE date closest to TwelveMonths,
HDL value associated with OBSERVATION_DATE date closest to TwentyFourMonths,

LDL OBSERVATION_DATE date closest to Base,
LDL OBSERVATION_DATE date closest to SixMonths,
LDL OBSERVATION_DATE date closest to TwelveMonths,
LDL OBSERVATION_DATE date closest to TwentyFourMonths,
LDL value associated with OBSERVATION_DATE date closest to Base,
LDL value associated with OBSERVATION_DATE date closest to SixMonths,
LDL value associated with OBSERVATION_DATE date closest to TwelveMonths,
LDL value associated with OBSERVATION_DATE date closest to TwentyFourMonths,

VLDL OBSERVATION_DATE date closest to Base,
VLDL OBSERVATION_DATE date closest to SixMonths,
VLDL OBSERVATION_DATE date closest to TwelveMonths,
VLDL OBSERVATION_DATE date closest to TwentyFourMonths,
VLDL value associated with OBSERVATION_DATE date closest to Base,
VLDL value associated with OBSERVATION_DATE date closest to SixMonths,
VLDL value associated with OBSERVATION_DATE date closest to TwelveMonths,
VLDL value associated with OBSERVATION_DATE date closest to TwentyFourMonths,

Here is my data:

structure(list(MRN = c(15842, 15842, 15842, 19463, 19463, 19463, 
19463, 19463, 19463, 19463, 19463, 19463, 19463, 19463, 19463, 
19463, 19463, 19463, 19463, 19463, 19463, 34025, 34025, 34025, 
34025, 34025, 34025, 37465, 37465, 37465, 68874, 68874, 68874, 
76133, 76133, 76133, 76133, 76133, 76133, 76133, 76133, 76133, 
76133, 76133, 76133, 76133, 76133, 76133, 76133, 76133), OBSERVATION_DATE = structure(c(18289, 
18289, 18289, 16073, 16073, 16073, 16434, 16434, 16434, 16536, 
16536, 16536, 16821, 16821, 16821, 17196, 17196, 17196, 17604, 
17604, 17604, 19114, 19114, 19114, 19338, 19338, 19338, 19060, 
19060, 19060, 19730, 19730, 19730, 17326, 17326, 17326, 17331, 
17331, 17331, 17333, 17333, 17333, 17336, 17336, 17336, 17339, 
17339, 17339, 17347, 17347), class = "Date"), HDL = c(NA, 47, 
NA, 40, NA, NA, NA, 43, NA, 38, NA, NA, NA, 41, NA, NA, 48, NA, 
NA, 45, NA, NA, 44, NA, NA, 42, NA, NA, NA, 56, 16, NA, NA, NA, 
34, NA, 34, NA, NA, 31, NA, NA, 33, NA, NA, NA, 32, NA, NA, NA
), LDL = c(NA, NA, 83, NA, 92, NA, 107, NA, NA, NA, NA, 112, 
93, NA, NA, 96, NA, NA, 109, NA, NA, NA, NA, 76, 56, NA, NA, 
141, NA, NA, NA, NA, 49, 55, NA, NA, NA, NA, 57, NA, 53, NA, 
NA, NA, 59, 55, NA, NA, NA, 55), VLDL = c(14, NA, NA, NA, NA, 
46, NA, NA, 30, NA, 30, NA, NA, NA, 28, NA, NA, 20, NA, NA, 28, 
17, NA, NA, NA, NA, 21, NA, 35, NA, NA, 15, NA, NA, NA, 24, NA, 
24, NA, NA, NA, 20, NA, 23, NA, NA, NA, 26, 22, NA), Base = structure(c(17647, 
17647, 17647, 17032, 17032, 17032, 17032, 17032, 17032, 17032, 
17032, 17032, 17032, 17032, 17032, 17032, 17032, 17032, 17032, 
17032, 17032, 18577, 18577, 18577, 18577, 18577, 18577, 18894, 
18894, 18894, 19431, 19431, 19431, 16751, 16751, 16751, 16751, 
16751, 16751, 16751, 16751, 16751, 16751, 16751, 16751, 16751, 
16751, 16751, 16751, 16751), class = "Date"), SixMonths = structure(c(17830, 
17830, 17830, 17215, 17215, 17215, 17215, 17215, 17215, 17215, 
17215, 17215, 17215, 17215, 17215, 17215, 17215, 17215, 17215, 
17215, 17215, 18760, 18760, 18760, 18760, 18760, 18760, 19077, 
19077, 19077, 19614, 19614, 19614, 16934, 16934, 16934, 16934, 
16934, 16934, 16934, 16934, 16934, 16934, 16934, 16934, 16934, 
16934, 16934, 16934, 16934), class = "Date"), TwelveMonths = structure(c(18012, 
18012, 18012, 17397, 17397, 17397, 17397, 17397, 17397, 17397, 
17397, 17397, 17397, 17397, 17397, 17397, 17397, 17397, 17397, 
17397, 17397, 18942, 18942, 18942, 18942, 18942, 18942, 19259, 
19259, 19259, 19796, 19796, 19796, 17116, 17116, 17116, 17116, 
17116, 17116, 17116, 17116, 17116, 17116, 17116, 17116, 17116, 
17116, 17116, 17116, 17116), class = "Date"), TwentyFourMonths = structure(c(18377, 
18377, 18377, 17762, 17762, 17762, 17762, 17762, 17762, 17762, 
17762, 17762, 17762, 17762, 17762, 17762, 17762, 17762, 17762, 
17762, 17762, 19307, 19307, 19307, 19307, 19307, 19307, 19624, 
19624, 19624, 20161, 20161, 20161, 17481, 17481, 17481, 17481, 
17481, 17481, 17481, 17481, 17481, 17481, 17481, 17481, 17481, 
17481, 17481, 17481, 17481), class = "Date")), row.names = c(NA, 
-50L), class = "data.frame")

Solution

  • I'm not sure I follow all the requirements but I think this essentially gets what you want.

    I make one table with all the valid non-NA measures. I make another with all the milestone dates for each patient.

    Then I full_join these, arrange in order of date difference, and keep the first observation for each MRN/milestone/measure.

    (For large data with, say, thousands of observations per patient, this full join could produce output that is too large to handle. As of dplyr 1.1.4 in mid-2023, we can't directly get the "closest" match with a join, but we could join with the closest date that is <= and separately join with the closest date that is >=, and pick between those as a final step.)

    library(tidyverse)
    df_measurements <- df |>
      select(MRN:VLDL) |>
      pivot_longer(HDL:VLDL, names_to = "measure", values_drop_na = TRUE)
    
    df_patient_timeline <- df |>
      select(-(OBSERVATION_DATE:VLDL)) |>
      distinct() |>
      pivot_longer(-MRN, names_to = "milestone", values_to = "DATE")
    
    df_patient_timeline |>
      full_join(df_measurements, join_by(MRN)) |>
      arrange(MRN, milestone, measure, abs(DATE - OBSERVATION_DATE)) |>
      slice(1, .by = c(MRN, milestone, measure))
    

    Result:

         MRN milestone        DATE       OBSERVATION_DATE measure value
       <dbl> <chr>            <date>     <date>           <chr>   <dbl>
     1 15842 Base             2018-04-26 2020-01-28       HDL        47
     2 15842 Base             2018-04-26 2020-01-28       LDL        83
     3 15842 Base             2018-04-26 2020-01-28       VLDL       14
     4 15842 SixMonths        2018-10-26 2020-01-28       HDL        47
     5 15842 SixMonths        2018-10-26 2020-01-28       LDL        83
     6 15842 SixMonths        2018-10-26 2020-01-28       VLDL       14
     7 15842 TwelveMonths     2019-04-26 2020-01-28       HDL        47
     8 15842 TwelveMonths     2019-04-26 2020-01-28       LDL        83
     9 15842 TwelveMonths     2019-04-26 2020-01-28       VLDL       14
    10 15842 TwentyFourMonths 2020-04-25 2020-01-28       HDL        47
    # ℹ 62 more rows
    

    I presume your intended output is a wider version of this data. If we add the code below...

    ... |> 
    pivot_wider(names_from = c(measure, milestone), 
                values_from = c(OBSERVATION_DATE, value))
    

    we get this super-wide table with (I think) the columns you sought:

    # A tibble: 24 × 26
         MRN DATE       OBSERVATION_DATE_HDL_Base OBSERVATION_DATE_LDL_Base OBSERVATION_DATE_VLDL_Base OBSERVATION_DATE_HDL_SixMonths OBSERVATION_DATE_LDL_S…¹ OBSERVATION_DATE_VLD…² OBSERVATION_DATE_HDL…³ OBSERVATION_DATE_LDL…⁴ OBSERVATION_DATE_VLD…⁵
       <dbl> <date>     <date>                    <date>                    <date>                     <date>                         <date>                   <date>                 <date>                 <date>                 <date>                
     1 15842 2018-04-26 2020-01-28                2020-01-28                2020-01-28                 NA                             NA                       NA                     NA                     NA                     NA                    
     2 15842 2018-10-26 NA                        NA                        NA                         2020-01-28                     2020-01-28               2020-01-28             NA                     NA                     NA                    
     3 15842 2019-04-26 NA                        NA                        NA                         NA                             NA                       NA                     2020-01-28             2020-01-28             2020-01-28            
     4 15842 2020-04-25 NA                        NA                        NA                         NA                             NA                       NA                     NA                     NA                     NA                    
     5 19463 2016-08-19 2017-01-30                2017-01-30                2017-01-30                 NA                             NA                       NA                     NA                     NA                     NA                    
     6 19463 2017-02-18 NA                        NA                        NA                         2017-01-30                     2017-01-30               2017-01-30             NA                     NA                     NA                    
     7 19463 2017-08-19 NA                        NA                        NA                         NA                             NA                       NA                     2017-01-30             2017-01-30             2017-01-30            
     8 19463 2018-08-19 NA                        NA                        NA                         NA                             NA                       NA                     NA                     NA                     NA                    
     9 34025 2020-11-11 2022-05-02                2022-05-02                2022-05-02                 NA                             NA                       NA                     NA                     NA                     NA                    
    10 34025 2021-05-13 NA                        NA                        NA                         2022-05-02                     2022-05-02               2022-05-02             NA                     NA                     NA                    
    # ℹ 14 more rows
    # ℹ abbreviated names: ¹​OBSERVATION_DATE_LDL_SixMonths, ²​OBSERVATION_DATE_VLDL_SixMonths, ³​OBSERVATION_DATE_HDL_TwelveMonths, ⁴​OBSERVATION_DATE_LDL_TwelveMonths, ⁵​OBSERVATION_DATE_VLDL_TwelveMonths
    # ℹ 15 more variables: OBSERVATION_DATE_HDL_TwentyFourMonths <date>, OBSERVATION_DATE_LDL_TwentyFourMonths <date>, OBSERVATION_DATE_VLDL_TwentyFourMonths <date>, value_HDL_Base <dbl>, value_LDL_Base <dbl>, value_VLDL_Base <dbl>,
    #   value_HDL_SixMonths <dbl>, value_LDL_SixMonths <dbl>, value_VLDL_SixMonths <dbl>, value_HDL_TwelveMonths <dbl>, value_LDL_TwelveMonths <dbl>, value_VLDL_TwelveMonths <dbl>, value_HDL_TwentyFourMonths <dbl>, value_LDL_TwentyFourMonths <dbl>,
    #   value_VLDL_TwentyFourMonths <dbl>
    # ℹ Use `print(n = ...)` to see more rows