Search code examples
rdatabasedata-cleaning

Splitting a column, pairing to another column


I have a dataset such as the one below, whereby each subject has an Inj-code, the date this occurred is given by a date in subsequent columns but provided by the first occurring code in the Inj_code variable. For instance, subject X has an inj_code variable of "F44|G44" with Date_0 providing the date of Inj_code F44, and Date_1 providing the date of Inj_code G44. A reproducible dataset is shown below:

ID <- c("1","2","3","4","5")
Inj_code <- c(
"M75.1", 
"C54.1|C55|D25.9", 
"D17.2|E78.0|G47.3|H02.3|H11.4", 
"K29.4|K29.6|K30|R11|R19.4", 
"O80.0|O99.0|Z37.0")

Date_0 <- c("2017-12-12",
"2010-08-04",
"2014-04-02",
"2021-12-03",
"2002-02-22")

Date_1 <- c("",
"2014-08-04",
"2015-04-02",
"2022-11-03",
"2004-04-22")

Date_2 <- c("",
"2015-08-04",
"2011-04-02",
"2002-12-03",
"2003-02-22")

Date_3 <- c("",
"",
"2005-06-05",
"2002-12-05",
"")

Date_4 <- c("",
"",
"2014-04-02",
"2021-12-03",
"")

ICD10 <- data.frame(ID, Inj_code, Date_0, Date_1, Date_2, Date_3, Date_4)

I am trying to do multiple steps to convert this into a format for analysis (to say which date each code occurred). The ideal output is something like |ID|Injury_code|Date in a long format. However a wide format (ID|Injury_code_0| Date_0|Injury_code_1|Date_1|Injury_code_2|Date_2 etc) would be good.

I have tried going:

  1. separation of the Inj_code by the separating character ("|")
  2. Pairing the Inj code and the date
  3. Making this into a long format

Apologies for the long post, this has been a difficult problem for me and I have tried a fair few ways!


Solution

  • You can use tidyr::separate_wider_delim() from the tidyverse to split the injury code column based on the | separator. This will split the injury codes into as many separate columns as required, naming them Inj_code_1, Inj_code_2, etc.

    library(tidyverse)
    
    out <- separate_wider_delim(data = ICD10, cols = Inj_code, delim = "|", names_sep = "_", too_few = "align_start")
    
    # A tibble: 5 × 11
      ID    Inj_code_1 Inj_code_2 Inj_code_3 Inj_code_4 Inj_code_5 Date_0     Date_1       Date_2       Date_3       Date_4    
      <chr> <chr>      <chr>      <chr>      <chr>      <chr>      <chr>      <chr>        <chr>        <chr>        <chr>     
    1 1     M75.1      NA         NA         NA         NA         2017-12-12 ""           ""           ""           ""        
    2 2     C54.1      C55        D25.9      NA         NA         2010-08-04 "2014-08-04" "2015-08-04" ""           ""        
    3 3     D17.2      E78.0      G47.3      H02.3      H11.4      2014-04-02 "2015-04-02" "2011-04-02" "2005-06-05" "2014-04-…
    4 4     K29.4      K29.6      K30        R11        R19.4      2021-12-03 "2022-11-03" "2002-12-03" "2002-12-05" "2021-12-…
    5 5     O80.0      O99.0      Z37.0      NA         NA         2002-02-22 "2004-04-22" "2003-02-22" "" 
    

    From that point there are various ways to get this wide data into a long format, but if you only have a relatively small number of columns then the following method is somewhat unsightly but straightforward:

    out_long <-
      # use bind_rows() to select ID plus matching pairs of injury code and date
      bind_rows(
        dplyr::select(out, ID, injury_code = Inj_code_1, date = Date_0),
        dplyr::select(out, ID, injury_code = Inj_code_2, date = Date_1),
        dplyr::select(out, ID, injury_code = Inj_code_3, date = Date_2),
        dplyr::select(out, ID, injury_code = Inj_code_4, date = Date_3),
        dplyr::select(out, ID, injury_code = Inj_code_5, date = Date_4)
      ) %>%
      # remove rows that have no injury code
      filter(
        !is.na(injury_code)
      ) %>%
      # sort by ID and injury code, or however you like
      arrange(
        ID, injury_code
      )
    
    # A tibble: 17 × 3
       ID    injury_code date      
       <chr> <chr>       <chr>     
     1 1     M75.1       2017-12-12
     2 2     C54.1       2010-08-04
     3 2     C55         2014-08-04
     4 2     D25.9       2015-08-04
     5 3     D17.2       2014-04-02
     6 3     E78.0       2015-04-02
     7 3     G47.3       2011-04-02
     8 3     H02.3       2005-06-05
     9 3     H11.4       2014-04-02
    10 4     K29.4       2021-12-03
    11 4     K29.6       2022-11-03
    12 4     K30         2002-12-03
    13 4     R11         2002-12-05
    14 4     R19.4       2021-12-03
    15 5     O80.0       2002-02-22
    16 5     O99.0       2004-04-22
    17 5     Z37.0       2003-02-22
    

    Update: an alternative using looping to account for any number of injury codes

    Here is a solution using looping with bind_rows() to build the final data.frame. This makes the assumption that there will always be a Date_X column to match each Inj_code_Y column, and that Y will always be one greater than X.

    First create an empty data.frame, and get the names of all the injury code columns:

    out_loop <-
      data.frame(
        ID = NULL,
        injury_code = NULL,
        date = NULL
      )
    
    injuryCodes <- names(out)[grepl("Inj_code", names(out))]
    

    For each injury code column, get the ID, injury code and corresponding date column, and append to data.frame:

    for (x in injuryCodes) {
      out_loop <-
        bind_rows(
          out_loop,
          out[, c("ID", x, paste0("Date_", as.numeric(str_extract(x, "[0-9]+$")) - 1))] %>% setNames(c("ID", "injury_code", "date"))
        )
    }
    
    out_loop <-
      filter(out_loop, !is.na(injury_code)) %>%
      arrange(ID, injury_code)
    
    > out_loop
       ID injury_code       date
    1   1       M75.1 2017-12-12
    2   2       C54.1 2010-08-04
    3   2         C55 2014-08-04
    4   2       D25.9 2015-08-04
    5   3       D17.2 2014-04-02
    6   3       E78.0 2015-04-02
    7   3       G47.3 2011-04-02
    8   3       H02.3 2005-06-05
    9   3       H11.4 2014-04-02
    10  4       K29.4 2021-12-03
    11  4       K29.6 2022-11-03
    12  4         K30 2002-12-03
    13  4         R11 2002-12-05
    14  4       R19.4 2021-12-03
    15  5       O80.0 2002-02-22
    16  5       O99.0 2004-04-22
    17  5       Z37.0 2003-02-22