Calculate person-time per calendar month using two date columns as references

I have a dataframe like the one below in R:

### Packages
library(tidyverse)
library(Epi)
library(survival)
library(lubridate)

### Create data:
End_Date <- as.Date("1968-01-01") + days(sample (c(250:365), size=500, replace =T))
Example_DF <- as.data.frame(End_Date)
Example_DF$Start_Date <- as.Date("1968-01-01")
Example_DF$Exposure <- Example_DF$End_Date - days(sample (c(1:249), size=500, replace =T))
Example_DF$ID <- seq(1,500,1)

What I want do is to for each calendar month from 1968-01 until and including 1969-05 create two new columns per calendar month that is summing up the number of days of person-time each person (ID) is providing as unexposed and exposed, respectively. These columns can for example be called 1968_01_Unexposed, 1968_01_Exposed etc.

The date of exposure is found in the column Exposure. What I want in the end is thus a dataframe with 41 columns (4 in the original dataframe plus 34 columns (2 per 17 calendar month between 1968-01 and 1969-05)). For example ID 1 is having 31 person days as unexposed for 1968-01, 0 days as exposed for 1968-01 etc until 1968-07, where ID 1 has 10 days of unexposed and 21 days as exposed.

Anyone knows how this can be done in a convenient way?

Solution

The following should get you going. In fact, you have developed part of the "algorithm" already yourself with the last para of your problem description.

Working with {tidyverse} and tibbles/data frames try to think in vectors/columns before presenting the result in a more human-readable wide-way.

I demonstrate the initial part of how you can go about it with your first 2 entries and solving the logical conditions on the number of days.

I leave it to you to apply this approach then to the exposed days and read up on {tidyr}'s pivot_wider() to spread your results across the desired columns.

While you provide some sample data and thus a reproducible example, the sample seems not to operate on 17 months. I did not check the example for further consistency.

library(tidyverse)
library(lubridate)

# first problem - each ID needs a month entry for our time horizon ---------------
## define the  time horizon
Month_Bin <- seq(from = min(Example_DF$Start_Date)
                 , to = max(Example_DF$End_Date)
                 , by = "month")

## expand your (here first 2 entries) over the time horizon
Example_DF[1:2,] %>%        # with [1:2,] the df is truncated to the first 2 rows - remove for full example
  expand(ID, Month_Bin)  

# combine with original data set to calculate conditions -----------------------

Example_DF[1:2,] %>% 
    expand(ID, Month_Bin) %>% 
    left_join(Example_DF, by = "ID") 

# with this data we can now work on the conditions and --------------------------
# determine the days
Example_DF[1:2,] %>% 
    expand(ID, Month_Bin) %>% 
    left_join(Example_DF, by = "ID") %>% 

## --------------- let's define whether the Month_Bin is before Exposure
## --------------- lubridate let's you work with "floored" dates ~ first of month 
mutate(
  Unexposed = floor_date( Exposure, "month") > floor_date(Month_Bin, "month")
, Exposed = floor_date(Exposure, "month")    < floor_date(Month_Bin, "month")) %>%

## -------------- now you can detemine the days per month based on the condition
## -------------- multiple if-else() conditions are nicely packed into case_when
 mutate(
    Unexposed_Days = case_when(
         Unexposed  & !Exposed ~ days_in_month(Month_Bin)
        ,!Unexposed & !Exposed ~ as.integer(difftime(Exposure, Month_Bin, "days"))
        ,TRUE ~ as.integer(NA)    # case_when() requires type consistency for default
        )
    ) %>% 
#--------------- for presentation I force the first 20 rows (ignore this)
head(20)

This yields:

# A tibble: 20 x 8
      ID Month_Bin  End_Date   Start_Date Exposure   Unexposed Exposed Unexposed_Days
   <dbl> <date>     <date>     <date>     <date>     <lgl>     <lgl>            <int>
 1     1 1968-01-01 1968-09-21 1968-01-01 1968-02-25 TRUE      FALSE               31
 2     1 1968-02-01 1968-09-21 1968-01-01 1968-02-25 FALSE     FALSE               24
 3     1 1968-03-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
 4     1 1968-04-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
 5     1 1968-05-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
 6     1 1968-06-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
 7     1 1968-07-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
 8     1 1968-08-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
 9     1 1968-09-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
10     1 1968-10-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
11     1 1968-11-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
12     1 1968-12-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
13     2 1968-01-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               31
14     2 1968-02-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               29
15     2 1968-03-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               31
16     2 1968-04-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               30
17     2 1968-05-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               31
18     2 1968-06-01 1968-12-11 1968-01-01 1968-06-21 FALSE     FALSE               20
19     2 1968-07-01 1968-12-11 1968-01-01 1968-06-21 FALSE     TRUE                NA
20     2 1968-08-01 1968-12-11 1968-01-01 1968-06-21 FALSE     TRUE                NA

You should be able to construct the required number of days for the exposed case.

Then read up on {tidyr} and pivot_longer to spread your long table to a wide format that you want to have.