I have a long table format and try to construct a wide format that contains additional variables based on the rows.
My data has rows based on patients, their treatments, and their visits. Each row represents a single visit as the lowest level in the data structure hierarchy. Below I have tried to illustrate the structure, although the real data has many more variables that I wish to keep in the output. Here I just illustrated additional variables as var_x and var_y.
I wish to get a table that:
As you can see in the example, the patient may have several visits before a new treatment starts. In the 'want' table I have tried to illustrate that patient 1 has bio_drug_stop_date based on existing info from original table and patient 2 and 3 have bio_drug_stop_date derived from first case of bio_drug_series==2.
have:
df1 <- data.frame(
patient_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
bio_drug_series_number = c(1,1,2,1,2,3,1,1,2,3),
bio_drug_start_date = c("12-01-2015", "12-01-2015", "20-05-2008", "01-09-2020", "03-09-2020", "04-02-2021", "05-07-2004", "05-07-2004", "17-09-2011", "24-05-2019"),
bio_drug_stop_date = c("18-05-2008", "18-05-2008", NA, NA, NA, NA, NA, NA, NA, NA),
var_x = c("c", "a", "f", "g", "n", "o", "p", "q", "t", "u"),
var_y = c("d", "b", "e", "m", "l", "k", "h", "r", "s", "v")
)
want:
df222 <- data.frame(
patient_id = c(1,2,3),
bio_drug_series_number = c(1,1,1),
bio_drug_start_date = c("12-01-2015", "01-09-2020", "05-07-2004"),
bio_drug_stop_date = c("18-05-2008", "02-09-2020", "16-09-2011"),
var_x = c("c", "g", "p"),
var_y = c("d", "m", "h")
)
I would have thought I could find a similar case on StackOverflow, but wasn't able to.
It's a fairly specific solution you need so that's why you likely didn't find an existing answer. However, the principles behind the following answer can be found elsewhere on SO. It involves a stepwise approach with a temporary column, and the process could be refined further, but it works.
One thing to note, your date columns are strings, so I converted them to Date format. This allows one day to be subtracted from the date where necessary. Based on your sample data, this solution assumes that if a date is present in the bio_drug_stop_date for a given patient, the first instance of a date is the date that you want.
library(dplyr)
# Sample data
df1 <- data.frame(
patient_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
bio_drug_series_number = c(1,1,2,1,2,3,1,1,2,3),
bio_drug_start_date = c("12-01-2015", "12-01-2015", "20-05-2008", "01-09-2020", "03-09-2020", "04-02-2021", "05-07-2004", "05-07-2004", "17-09-2011", "24-05-2019"),
bio_drug_stop_date = c("18-05-2008", "18-05-2008", NA, NA, NA, NA, NA, NA, NA, NA),
var_x = c("c", "a", "f", "g", "n", "o", "p", "q", "t", "u"),
var_y = c("d", "b", "e", "m", "l", "k", "h", "r", "s", "v")
)
# Convert date columns to as.Date()
df1$bio_drug_start_date <- as.Date(df1$bio_drug_start_date, format = "%d-%m-%Y")
df1$bio_drug_stop_date <- as.Date(df1$bio_drug_stop_date, format = "%d-%m-%Y")
# Assign missing date values to bio_drug_stop_date, subtract one day if necessary,
# and return first row of each group
df2 <- df1 |>
group_by(patient_id) |>
mutate(tmp = if_else(any(!is.na(bio_drug_stop_date)),
first(na.omit(bio_drug_stop_date)),
first(bio_drug_start_date[bio_drug_series_number == 2]) - 1),
bio_drug_stop_date = if_else(is.na(bio_drug_stop_date),
tmp, bio_drug_stop_date)) |>
select(-tmp) |>
slice(1) |>
ungroup()
df2
# # A tibble: 3 × 6
# patient_id bio_drug_series_number bio_drug_start_date bio_drug_stop_date var_x var_y
# <dbl> <dbl> <date> <date> <chr> <chr>
# 1 1 1 2015-01-12 2008-05-18 c d
# 2 2 1 2020-09-01 2020-09-02 g m
# 3 3 1 2004-07-05 2011-09-16 p h