Splitting a column, pairing to another column

I have a dataset such as the one below, whereby each subject has an Inj-code, the date this occurred is given by a date in subsequent columns but provided by the first occurring code in the Inj_code variable. For instance, subject X has an inj_code variable of "F44|G44" with Date_0 providing the date of Inj_code F44, and Date_1 providing the date of Inj_code G44. A reproducible dataset is shown below:

ID <- c("1","2","3","4","5")
Inj_code <- c(
"M75.1", 
"C54.1|C55|D25.9", 
"D17.2|E78.0|G47.3|H02.3|H11.4", 
"K29.4|K29.6|K30|R11|R19.4", 
"O80.0|O99.0|Z37.0")

Date_0 <- c("2017-12-12",
"2010-08-04",
"2014-04-02",
"2021-12-03",
"2002-02-22")

Date_1 <- c("",
"2014-08-04",
"2015-04-02",
"2022-11-03",
"2004-04-22")

Date_2 <- c("",
"2015-08-04",
"2011-04-02",
"2002-12-03",
"2003-02-22")

Date_3 <- c("",
"",
"2005-06-05",
"2002-12-05",
"")

Date_4 <- c("",
"",
"2014-04-02",
"2021-12-03",
"")

ICD10 <- data.frame(ID, Inj_code, Date_0, Date_1, Date_2, Date_3, Date_4)

I have tried going:

separation of the Inj_code by the separating character ("|")
Pairing the Inj code and the date
Making this into a long format

Apologies for the long post, this has been a difficult problem for me and I have tried a fair few ways!

Solution

You can use tidyr::separate_wider_delim() from the tidyverse to split the injury code column based on the | separator. This will split the injury codes into as many separate columns as required, naming them Inj_code_1, Inj_code_2, etc.

library(tidyverse)

out <- separate_wider_delim(data = ICD10, cols = Inj_code, delim = "|", names_sep = "_", too_few = "align_start")

# A tibble: 5 × 11
  ID    Inj_code_1 Inj_code_2 Inj_code_3 Inj_code_4 Inj_code_5 Date_0     Date_1       Date_2       Date_3       Date_4    
  <chr> <chr>      <chr>      <chr>      <chr>      <chr>      <chr>      <chr>        <chr>        <chr>        <chr>     
1 1     M75.1      NA         NA         NA         NA         2017-12-12 ""           ""           ""           ""        
2 2     C54.1      C55        D25.9      NA         NA         2010-08-04 "2014-08-04" "2015-08-04" ""           ""        
3 3     D17.2      E78.0      G47.3      H02.3      H11.4      2014-04-02 "2015-04-02" "2011-04-02" "2005-06-05" "2014-04-…
4 4     K29.4      K29.6      K30        R11        R19.4      2021-12-03 "2022-11-03" "2002-12-03" "2002-12-05" "2021-12-…
5 5     O80.0      O99.0      Z37.0      NA         NA         2002-02-22 "2004-04-22" "2003-02-22" ""

From that point there are various ways to get this wide data into a long format, but if you only have a relatively small number of columns then the following method is somewhat unsightly but straightforward:

out_long <-
  # use bind_rows() to select ID plus matching pairs of injury code and date
  bind_rows(
    dplyr::select(out, ID, injury_code = Inj_code_1, date = Date_0),
    dplyr::select(out, ID, injury_code = Inj_code_2, date = Date_1),
    dplyr::select(out, ID, injury_code = Inj_code_3, date = Date_2),
    dplyr::select(out, ID, injury_code = Inj_code_4, date = Date_3),
    dplyr::select(out, ID, injury_code = Inj_code_5, date = Date_4)
  ) %>%
  # remove rows that have no injury code
  filter(
    !is.na(injury_code)
  ) %>%
  # sort by ID and injury code, or however you like
  arrange(
    ID, injury_code
  )

# A tibble: 17 × 3
   ID    injury_code date      
   <chr> <chr>       <chr>     
 1 1     M75.1       2017-12-12
 2 2     C54.1       2010-08-04
 3 2     C55         2014-08-04
 4 2     D25.9       2015-08-04
 5 3     D17.2       2014-04-02
 6 3     E78.0       2015-04-02
 7 3     G47.3       2011-04-02
 8 3     H02.3       2005-06-05
 9 3     H11.4       2014-04-02
10 4     K29.4       2021-12-03
11 4     K29.6       2022-11-03
12 4     K30         2002-12-03
13 4     R11         2002-12-05
14 4     R19.4       2021-12-03
15 5     O80.0       2002-02-22
16 5     O99.0       2004-04-22
17 5     Z37.0       2003-02-22

Update: an alternative using looping to account for any number of injury codes

Here is a solution using looping with bind_rows() to build the final data.frame. This makes the assumption that there will always be a Date_X column to match each Inj_code_Y column, and that Y will always be one greater than X.

First create an empty data.frame, and get the names of all the injury code columns:

out_loop <-
  data.frame(
    ID = NULL,
    injury_code = NULL,
    date = NULL
  )

injuryCodes <- names(out)[grepl("Inj_code", names(out))]

For each injury code column, get the ID, injury code and corresponding date column, and append to data.frame:

for (x in injuryCodes) {
  out_loop <-
    bind_rows(
      out_loop,
      out[, c("ID", x, paste0("Date_", as.numeric(str_extract(x, "[0-9]+$")) - 1))] %>% setNames(c("ID", "injury_code", "date"))
    )
}

out_loop <-
  filter(out_loop, !is.na(injury_code)) %>%
  arrange(ID, injury_code)

> out_loop
   ID injury_code       date
1   1       M75.1 2017-12-12
2   2       C54.1 2010-08-04
3   2         C55 2014-08-04
4   2       D25.9 2015-08-04
5   3       D17.2 2014-04-02
6   3       E78.0 2015-04-02
7   3       G47.3 2011-04-02
8   3       H02.3 2005-06-05
9   3       H11.4 2014-04-02
10  4       K29.4 2021-12-03
11  4       K29.6 2022-11-03
12  4         K30 2002-12-03
13  4         R11 2002-12-05
14  4       R19.4 2021-12-03
15  5       O80.0 2002-02-22
16  5       O99.0 2004-04-22
17  5       Z37.0 2003-02-22