Search code examples
rconditional-statementsreshapewide-format-data

Reshape a df from long to wide format using conditional statement


I have a long df that looks like this:

df <- data.frame(id=as.integer(c(123,123,123,124,124,124,125,125,126,126,126)),
                 date=as.Date(c("2014-03-12", "2015-05-02", "2015-09-16", "2015-10-24", "2016-12-11", "2017-10-17", "2017-08-06", "2018-01-29", "2015-09-16", "2015-11-12", "2015-12-03")),
                 event=as.character(c("A", "C", "E", "A", "B", "D", "A", "E", "A", "B", "E")),
                 order=as.integer(c(1,2,3,1,2,3,1,2,1,2,3)),
                 diff=as.integer(c(0,416,553,0,414,724,0,176,0,57,78)))

df
    id       date event order diff
1  123 2014-03-12     A     1    0
2  123 2015-05-02     C     2  416
3  123 2015-09-16     E     3  553
4  124 2015-10-24     A     1    0
5  124 2016-12-11     B     2  414
6  124 2017-10-17     D     3  724
7  125 2017-08-06     A     1    0
8  125 2018-01-29     E     2  176
9  126 2015-09-16     A     1    0
10 126 2015-11-12     B     2   57
11 126 2015-12-03     E     3   78

Each id will always have an initial event A and a final event, either D or E (mutually exclusive). Events B and C might or might not occur. diff is the difference in days between the date of each event and the date of the initial event A per id.

I want to obtain a wide df where each event will be a column (e.g. A_status) where 0= absent; 1= present. Similarly, each correspondent diff will be a column (e.g. A_time). But when event B or C is absent (e.g B= 0 or C= 0), I want their time to be filled with either D_time or E_time, whichever is present.

I need to create two columns based on the values of D and E:

  1. a column D.E_status, where 0=D; 1=E, and
  2. a column D.E_time that will receive whichever time is recorded (of D or E).

This is the desired output:

id  A_status A_time B_status B_time C_status C_time D.E_status  D.E_time
123 1        0      0        553    1        416    1           553
124 1        0      1        414    0        724    0           724
125 1        0      0        176    0        176    1           176
126 1        0      1        57     0        78     1           78

I really appreciate your help on this given my very basic R skills.


Solution

  • I'll demonstrate a dplyr/tidyr solution. Up front, I think your last value for B_time should be 57 not 78.

    library(dplyr)
    library(tidyr) # pivot_wider
    df %>%
      mutate(
        status = if_else(event == "D", 0, 1),
        event = if_else(event %in% c("D", "E"), "D.E", event)
      ) %>%
      pivot_wider(
        id_cols = "id",
        names_from = "event", values_from = c("diff", "status")
      ) %>%
      rename_with(.fn = ~ sub("diff", "time", sub("(.*)_(.*)", "\\2_\\1", .))) %>%
      mutate(
        across(c(C_time, B_time), ~ coalesce(., D.E_time)),
        across(c(C_status, B_status), ~ +(!is.na(.)))
      ) %>%
      select(order(colnames(.))) %>%
      relocate(id)
    # # A tibble: 4 × 9
    #      id A_status A_time B_status B_time C_status C_time D.E_status D.E_time
    #   <int>    <dbl>  <int>    <int>  <int>    <int>  <int>      <dbl>    <int>
    # 1   123        1      0        0    553        1    416          1      553
    # 2   124        1      0        1    414        0    724          0      724
    # 3   125        1      0        0    176        0    176          1      176
    # 4   126        1      0        1     57        0     78          1       78