I have a dataframe like the one below in R:
### Packages
library(tidyverse)
library(Epi)
library(survival)
library(lubridate)
### Create data:
End_Date <- as.Date("1968-01-01") + days(sample (c(250:365), size=500, replace =T))
Example_DF <- as.data.frame(End_Date)
Example_DF$Start_Date <- as.Date("1968-01-01")
Example_DF$Exposure <- Example_DF$End_Date - days(sample (c(1:249), size=500, replace =T))
Example_DF$ID <- seq(1,500,1)
What I want do is to for each calendar month from 1968-01 until and including 1969-05 create two new columns per calendar month that is summing up the number of days of person-time each person (ID) is providing as unexposed and exposed, respectively. These columns can for example be called 1968_01_Unexposed, 1968_01_Exposed etc.
The date of exposure is found in the column Exposure. What I want in the end is thus a dataframe with 41 columns (4 in the original dataframe plus 34 columns (2 per 17 calendar month between 1968-01 and 1969-05)). For example ID 1 is having 31 person days as unexposed for 1968-01, 0 days as exposed for 1968-01 etc until 1968-07, where ID 1 has 10 days of unexposed and 21 days as exposed.
Anyone knows how this can be done in a convenient way?
The following should get you going. In fact, you have developed part of the "algorithm" already yourself with the last para of your problem description.
Working with {tidyverse}
and tibbles
/data frames
try to think in vectors/columns before presenting the result in a more human-readable wide
-way.
I demonstrate the initial part of how you can go about it with your first 2 entries and solving the logical conditions on the number of days.
I leave it to you to apply this approach then to the exposed days and read up on {tidyr}
's pivot_wider()
to spread your results across the desired columns.
While you provide some sample data and thus a reproducible example, the sample seems not to operate on 17 months. I did not check the example for further consistency.
library(tidyverse)
library(lubridate)
# first problem - each ID needs a month entry for our time horizon ---------------
## define the time horizon
Month_Bin <- seq(from = min(Example_DF$Start_Date)
, to = max(Example_DF$End_Date)
, by = "month")
## expand your (here first 2 entries) over the time horizon
Example_DF[1:2,] %>% # with [1:2,] the df is truncated to the first 2 rows - remove for full example
expand(ID, Month_Bin)
# combine with original data set to calculate conditions -----------------------
Example_DF[1:2,] %>%
expand(ID, Month_Bin) %>%
left_join(Example_DF, by = "ID")
# with this data we can now work on the conditions and --------------------------
# determine the days
Example_DF[1:2,] %>%
expand(ID, Month_Bin) %>%
left_join(Example_DF, by = "ID") %>%
## --------------- let's define whether the Month_Bin is before Exposure
## --------------- lubridate let's you work with "floored" dates ~ first of month
mutate(
Unexposed = floor_date( Exposure, "month") > floor_date(Month_Bin, "month")
, Exposed = floor_date(Exposure, "month") < floor_date(Month_Bin, "month")) %>%
## -------------- now you can detemine the days per month based on the condition
## -------------- multiple if-else() conditions are nicely packed into case_when
mutate(
Unexposed_Days = case_when(
Unexposed & !Exposed ~ days_in_month(Month_Bin)
,!Unexposed & !Exposed ~ as.integer(difftime(Exposure, Month_Bin, "days"))
,TRUE ~ as.integer(NA) # case_when() requires type consistency for default
)
) %>%
#--------------- for presentation I force the first 20 rows (ignore this)
head(20)
This yields:
# A tibble: 20 x 8
ID Month_Bin End_Date Start_Date Exposure Unexposed Exposed Unexposed_Days
<dbl> <date> <date> <date> <date> <lgl> <lgl> <int>
1 1 1968-01-01 1968-09-21 1968-01-01 1968-02-25 TRUE FALSE 31
2 1 1968-02-01 1968-09-21 1968-01-01 1968-02-25 FALSE FALSE 24
3 1 1968-03-01 1968-09-21 1968-01-01 1968-02-25 FALSE TRUE NA
4 1 1968-04-01 1968-09-21 1968-01-01 1968-02-25 FALSE TRUE NA
5 1 1968-05-01 1968-09-21 1968-01-01 1968-02-25 FALSE TRUE NA
6 1 1968-06-01 1968-09-21 1968-01-01 1968-02-25 FALSE TRUE NA
7 1 1968-07-01 1968-09-21 1968-01-01 1968-02-25 FALSE TRUE NA
8 1 1968-08-01 1968-09-21 1968-01-01 1968-02-25 FALSE TRUE NA
9 1 1968-09-01 1968-09-21 1968-01-01 1968-02-25 FALSE TRUE NA
10 1 1968-10-01 1968-09-21 1968-01-01 1968-02-25 FALSE TRUE NA
11 1 1968-11-01 1968-09-21 1968-01-01 1968-02-25 FALSE TRUE NA
12 1 1968-12-01 1968-09-21 1968-01-01 1968-02-25 FALSE TRUE NA
13 2 1968-01-01 1968-12-11 1968-01-01 1968-06-21 TRUE FALSE 31
14 2 1968-02-01 1968-12-11 1968-01-01 1968-06-21 TRUE FALSE 29
15 2 1968-03-01 1968-12-11 1968-01-01 1968-06-21 TRUE FALSE 31
16 2 1968-04-01 1968-12-11 1968-01-01 1968-06-21 TRUE FALSE 30
17 2 1968-05-01 1968-12-11 1968-01-01 1968-06-21 TRUE FALSE 31
18 2 1968-06-01 1968-12-11 1968-01-01 1968-06-21 FALSE FALSE 20
19 2 1968-07-01 1968-12-11 1968-01-01 1968-06-21 FALSE TRUE NA
20 2 1968-08-01 1968-12-11 1968-01-01 1968-06-21 FALSE TRUE NA
You should be able to construct the required number of days for the exposed case.
Then read up on {tidyr}
and pivot_longer
to spread your long table to a wide format that you want to have.