Search code examples
rlong-integer

How to change long format into wide and add new derived variable based on the original rows


I have a long table format and try to construct a wide format that contains additional variables based on the rows.

My data has rows based on patients, their treatments, and their visits. Each row represents a single visit as the lowest level in the data structure hierarchy. Below I have tried to illustrate the structure, although the real data has many more variables that I wish to keep in the output. Here I just illustrated additional variables as var_x and var_y.

I wish to get a table that:

  1. keeps all variables from the original dataset
  2. only include the first visit row per patient, and
  3. includes the derived bio_drug_stop_date in that same row when there is no info about the stop date, which is often the case. The derived stop date is based on the date of the next treatment (first case of bio_drug_series==2) minus one day. Derived date is only added if there is not already info about stop date.
  4. Some patients may have only one treatment, and if so, the stop date is kept empty if there is no prior info about stop date.

As you can see in the example, the patient may have several visits before a new treatment starts. In the 'want' table I have tried to illustrate that patient 1 has bio_drug_stop_date based on existing info from original table and patient 2 and 3 have bio_drug_stop_date derived from first case of bio_drug_series==2.

have:

df1 <- data.frame(
  patient_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
  bio_drug_series_number = c(1,1,2,1,2,3,1,1,2,3),
  bio_drug_start_date = c("12-01-2015", "12-01-2015", "20-05-2008", "01-09-2020", "03-09-2020", "04-02-2021", "05-07-2004", "05-07-2004", "17-09-2011", "24-05-2019"),  
  bio_drug_stop_date = c("18-05-2008", "18-05-2008", NA, NA, NA, NA, NA, NA, NA, NA),  
  var_x = c("c", "a", "f", "g", "n", "o", "p", "q", "t", "u"),
  var_y = c("d", "b", "e", "m", "l", "k", "h", "r", "s", "v")
)

want:

df222 <- data.frame(
  patient_id = c(1,2,3),
  bio_drug_series_number = c(1,1,1),
  bio_drug_start_date = c("12-01-2015", "01-09-2020", "05-07-2004"),  
  bio_drug_stop_date = c("18-05-2008", "02-09-2020", "16-09-2011"),  
  var_x = c("c", "g", "p"),
  var_y = c("d", "m", "h")
)

I would have thought I could find a similar case on StackOverflow, but wasn't able to.


Solution

  • It's a fairly specific solution you need so that's why you likely didn't find an existing answer. However, the principles behind the following answer can be found elsewhere on SO. It involves a stepwise approach with a temporary column, and the process could be refined further, but it works.

    One thing to note, your date columns are strings, so I converted them to Date format. This allows one day to be subtracted from the date where necessary. Based on your sample data, this solution assumes that if a date is present in the bio_drug_stop_date for a given patient, the first instance of a date is the date that you want.

    library(dplyr)
    
    # Sample data
    df1 <- data.frame(
      patient_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3),
      bio_drug_series_number = c(1,1,2,1,2,3,1,1,2,3),
      bio_drug_start_date = c("12-01-2015", "12-01-2015", "20-05-2008", "01-09-2020", "03-09-2020", "04-02-2021", "05-07-2004", "05-07-2004", "17-09-2011", "24-05-2019"),  
      bio_drug_stop_date = c("18-05-2008", "18-05-2008", NA, NA, NA, NA, NA, NA, NA, NA),  
      var_x = c("c", "a", "f", "g", "n", "o", "p", "q", "t", "u"),
      var_y = c("d", "b", "e", "m", "l", "k", "h", "r", "s", "v")
    )
    
    # Convert date columns to as.Date()
    df1$bio_drug_start_date <- as.Date(df1$bio_drug_start_date, format = "%d-%m-%Y")
    df1$bio_drug_stop_date <- as.Date(df1$bio_drug_stop_date, format = "%d-%m-%Y")
    
    # Assign missing date values to bio_drug_stop_date, subtract one day if necessary,
    # and return first row of each group
    df2 <- df1 |>
      group_by(patient_id) |>
      mutate(tmp = if_else(any(!is.na(bio_drug_stop_date)), 
                           first(na.omit(bio_drug_stop_date)), 
                           first(bio_drug_start_date[bio_drug_series_number == 2]) - 1),
             bio_drug_stop_date = if_else(is.na(bio_drug_stop_date), 
                                          tmp, bio_drug_stop_date)) |>
      select(-tmp) |>
      slice(1) |>
      ungroup()
    
    df2
    # # A tibble: 3 × 6
    #   patient_id bio_drug_series_number bio_drug_start_date bio_drug_stop_date var_x var_y
    #        <dbl>                  <dbl> <date>              <date>             <chr> <chr>
    # 1          1                      1 2015-01-12          2008-05-18         c     d    
    # 2          2                      1 2020-09-01          2020-09-02         g     m    
    # 3          3                      1 2004-07-05          2011-09-16         p     h