Search code examples
rdateexposure

Calculate person-time per calendar month using two date columns as references


I have a dataframe like the one below in R:

### Packages
library(tidyverse)
library(Epi)
library(survival)
library(lubridate)

### Create data:
End_Date <- as.Date("1968-01-01") + days(sample (c(250:365), size=500, replace =T))
Example_DF <- as.data.frame(End_Date)
Example_DF$Start_Date <- as.Date("1968-01-01")
Example_DF$Exposure <- Example_DF$End_Date - days(sample (c(1:249), size=500, replace =T))
Example_DF$ID <- seq(1,500,1)

What I want do is to for each calendar month from 1968-01 until and including 1969-05 create two new columns per calendar month that is summing up the number of days of person-time each person (ID) is providing as unexposed and exposed, respectively. These columns can for example be called 1968_01_Unexposed, 1968_01_Exposed etc.

The date of exposure is found in the column Exposure. What I want in the end is thus a dataframe with 41 columns (4 in the original dataframe plus 34 columns (2 per 17 calendar month between 1968-01 and 1969-05)). For example ID 1 is having 31 person days as unexposed for 1968-01, 0 days as exposed for 1968-01 etc until 1968-07, where ID 1 has 10 days of unexposed and 21 days as exposed.

Anyone knows how this can be done in a convenient way?


Solution

  • The following should get you going. In fact, you have developed part of the "algorithm" already yourself with the last para of your problem description.

    Working with {tidyverse} and tibbles/data frames try to think in vectors/columns before presenting the result in a more human-readable wide-way.

    I demonstrate the initial part of how you can go about it with your first 2 entries and solving the logical conditions on the number of days.

    I leave it to you to apply this approach then to the exposed days and read up on {tidyr}'s pivot_wider() to spread your results across the desired columns.

    While you provide some sample data and thus a reproducible example, the sample seems not to operate on 17 months. I did not check the example for further consistency.

    library(tidyverse)
    library(lubridate)
    
    # first problem - each ID needs a month entry for our time horizon ---------------
    ## define the  time horizon
    Month_Bin <- seq(from = min(Example_DF$Start_Date)
                     , to = max(Example_DF$End_Date)
                     , by = "month")
    
    ## expand your (here first 2 entries) over the time horizon
    Example_DF[1:2,] %>%        # with [1:2,] the df is truncated to the first 2 rows - remove for full example
      expand(ID, Month_Bin)  
    
    # combine with original data set to calculate conditions -----------------------
    
    Example_DF[1:2,] %>% 
        expand(ID, Month_Bin) %>% 
        left_join(Example_DF, by = "ID") 
    
    # with this data we can now work on the conditions and --------------------------
    # determine the days
    Example_DF[1:2,] %>% 
        expand(ID, Month_Bin) %>% 
        left_join(Example_DF, by = "ID") %>% 
    
    ## --------------- let's define whether the Month_Bin is before Exposure
    ## --------------- lubridate let's you work with "floored" dates ~ first of month 
    mutate(
      Unexposed = floor_date( Exposure, "month") > floor_date(Month_Bin, "month")
    , Exposed = floor_date(Exposure, "month")    < floor_date(Month_Bin, "month")) %>%
    
    ## -------------- now you can detemine the days per month based on the condition
    ## -------------- multiple if-else() conditions are nicely packed into case_when
     mutate(
        Unexposed_Days = case_when(
             Unexposed  & !Exposed ~ days_in_month(Month_Bin)
            ,!Unexposed & !Exposed ~ as.integer(difftime(Exposure, Month_Bin, "days"))
            ,TRUE ~ as.integer(NA)    # case_when() requires type consistency for default
            )
        ) %>% 
    #--------------- for presentation I force the first 20 rows (ignore this)
    head(20)
    

    This yields:

    # A tibble: 20 x 8
          ID Month_Bin  End_Date   Start_Date Exposure   Unexposed Exposed Unexposed_Days
       <dbl> <date>     <date>     <date>     <date>     <lgl>     <lgl>            <int>
     1     1 1968-01-01 1968-09-21 1968-01-01 1968-02-25 TRUE      FALSE               31
     2     1 1968-02-01 1968-09-21 1968-01-01 1968-02-25 FALSE     FALSE               24
     3     1 1968-03-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
     4     1 1968-04-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
     5     1 1968-05-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
     6     1 1968-06-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
     7     1 1968-07-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
     8     1 1968-08-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
     9     1 1968-09-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
    10     1 1968-10-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
    11     1 1968-11-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
    12     1 1968-12-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
    13     2 1968-01-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               31
    14     2 1968-02-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               29
    15     2 1968-03-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               31
    16     2 1968-04-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               30
    17     2 1968-05-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               31
    18     2 1968-06-01 1968-12-11 1968-01-01 1968-06-21 FALSE     FALSE               20
    19     2 1968-07-01 1968-12-11 1968-01-01 1968-06-21 FALSE     TRUE                NA
    20     2 1968-08-01 1968-12-11 1968-01-01 1968-06-21 FALSE     TRUE                NA
    

    You should be able to construct the required number of days for the exposed case.

    Then read up on {tidyr} and pivot_longer to spread your long table to a wide format that you want to have.