Search code examples
rdplyrdate-range

R (dplyr): find all rows in row-specific range


I have a dataset where each row is identified by a hospitalization id. Each row contains information on the hospitalization id, hospital id where it took place as well as date of admission and date of discharge of the given hospitalization.

I would like to know, for each hospitalization, the id of all other hospitalizations concluded in the 30 days before the beginning of the given hospitalization in the given hospital where it took place.

Below is a simple example of 6 hospitalizations taking place in 2 hospitals.

library("tidyverse")

df <- data.frame(hospitalization_id = c(1, 2, 3, 
                                        4, 5, 6),
                 hospital_id = c("A", "A", "A", 
                                 "B", "B", "B"),
                 date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
                                        "2000-02-10", "2000-02-12", "2000-02-12")),
                 date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
                                      "2000-02-11", "2000-02-14", "2000-01-17")))

I created the dates in the 30-day interval prior to the beginning of each given hospitalization.

df_with_date_range <- df %>%
  mutate(date_range1 = date_start - 31,
         date_range2 = date_start - 1)

I am trying to write a code that adds a column with all the hospitalizations that took place in this 30-day interval range. My desired output is below:

df_final <- df_with_date_range %>%
  filter(hospitalization_id == 3) %>%
  bind_rows(df_with_date_range) %>%
  arrange(hospitalization_id) %>%
  mutate(hospitalization_id_in_range = c(NA, 1, 1, 2, NA, 4, 4))

In hospital A:

  • there was no hospitalization in the 30-day period before hospitalization 1
  • hospitalization 1 took place in the 30-day period before hospitalization 2
  • hospitalization 1 & 2 took place in the 30-day period before hospitalization 3

In hospital B:

  • there was no hospitalization in the 30-day period before hospitalization 4
  • hospitalization 4 took place in the 30-day period before hospitalization 5
  • hospitalization 4 took place in the 30-day period before hospitalization 6

Solution

  • With data.table:

    library(data.table)
    
    setDT(df)
    df[,from:=date_start-30]
    
    df[df,.(hospital_id,
            hospitalization_id = i.hospitalization_id, 
            date_start = i.date_start,
            date_end = i.date_end,
            prev_hospitalization_id = x.hospitalization_id,
            prev_date_start = x.date_start),
          on=.(hospital_id,date_start>=from,date_start<date_start)]
    
       hospital_id hospitalization_id date_start   date_end prev_hospitalization_id prev_date_start
            <char>              <num>     <Date>     <Date>                   <num>          <Date>
    1:           A                  1 2000-01-01 2000-01-03                      NA            <NA>
    2:           A                  2 2000-01-12 2000-01-18                       1      2000-01-01
    3:           A                  3 2000-01-20 2000-01-22                       1      2000-01-01
    4:           A                  3 2000-01-20 2000-01-22                       2      2000-01-12
    5:           B                  4 2000-02-10 2000-02-11                      NA            <NA>
    6:           B                  5 2000-02-12 2000-02-14                       4      2000-02-10
    7:           B                  6 2000-02-12 2000-01-17                       4      2000-02-10
    

    The x and i prefixes refer to the X[i,j,by] syntax specific to data.table.
    This means in the example above that x is the outer df and i the inner df.