I have a dataset such as the one below, whereby each subject has an Inj-code, the date this occurred is given by a date in subsequent columns but provided by the first occurring code in the Inj_code variable. For instance, subject X has an inj_code variable of "F44|G44" with Date_0 providing the date of Inj_code F44, and Date_1 providing the date of Inj_code G44. A reproducible dataset is shown below:
ID <- c("1","2","3","4","5")
Inj_code <- c(
"M75.1",
"C54.1|C55|D25.9",
"D17.2|E78.0|G47.3|H02.3|H11.4",
"K29.4|K29.6|K30|R11|R19.4",
"O80.0|O99.0|Z37.0")
Date_0 <- c("2017-12-12",
"2010-08-04",
"2014-04-02",
"2021-12-03",
"2002-02-22")
Date_1 <- c("",
"2014-08-04",
"2015-04-02",
"2022-11-03",
"2004-04-22")
Date_2 <- c("",
"2015-08-04",
"2011-04-02",
"2002-12-03",
"2003-02-22")
Date_3 <- c("",
"",
"2005-06-05",
"2002-12-05",
"")
Date_4 <- c("",
"",
"2014-04-02",
"2021-12-03",
"")
ICD10 <- data.frame(ID, Inj_code, Date_0, Date_1, Date_2, Date_3, Date_4)
I am trying to do multiple steps to convert this into a format for analysis (to say which date each code occurred). The ideal output is something like |ID|Injury_code|Date in a long format. However a wide format (ID|Injury_code_0| Date_0|Injury_code_1|Date_1|Injury_code_2|Date_2 etc) would be good.
I have tried going:
Apologies for the long post, this has been a difficult problem for me and I have tried a fair few ways!
You can use tidyr::separate_wider_delim()
from the tidyverse to split the injury code column based on the | separator. This will split the injury codes into as many separate columns as required, naming them Inj_code_1
, Inj_code_2
, etc.
library(tidyverse)
out <- separate_wider_delim(data = ICD10, cols = Inj_code, delim = "|", names_sep = "_", too_few = "align_start")
# A tibble: 5 × 11
ID Inj_code_1 Inj_code_2 Inj_code_3 Inj_code_4 Inj_code_5 Date_0 Date_1 Date_2 Date_3 Date_4
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 M75.1 NA NA NA NA 2017-12-12 "" "" "" ""
2 2 C54.1 C55 D25.9 NA NA 2010-08-04 "2014-08-04" "2015-08-04" "" ""
3 3 D17.2 E78.0 G47.3 H02.3 H11.4 2014-04-02 "2015-04-02" "2011-04-02" "2005-06-05" "2014-04-…
4 4 K29.4 K29.6 K30 R11 R19.4 2021-12-03 "2022-11-03" "2002-12-03" "2002-12-05" "2021-12-…
5 5 O80.0 O99.0 Z37.0 NA NA 2002-02-22 "2004-04-22" "2003-02-22" ""
From that point there are various ways to get this wide data into a long format, but if you only have a relatively small number of columns then the following method is somewhat unsightly but straightforward:
out_long <-
# use bind_rows() to select ID plus matching pairs of injury code and date
bind_rows(
dplyr::select(out, ID, injury_code = Inj_code_1, date = Date_0),
dplyr::select(out, ID, injury_code = Inj_code_2, date = Date_1),
dplyr::select(out, ID, injury_code = Inj_code_3, date = Date_2),
dplyr::select(out, ID, injury_code = Inj_code_4, date = Date_3),
dplyr::select(out, ID, injury_code = Inj_code_5, date = Date_4)
) %>%
# remove rows that have no injury code
filter(
!is.na(injury_code)
) %>%
# sort by ID and injury code, or however you like
arrange(
ID, injury_code
)
# A tibble: 17 × 3
ID injury_code date
<chr> <chr> <chr>
1 1 M75.1 2017-12-12
2 2 C54.1 2010-08-04
3 2 C55 2014-08-04
4 2 D25.9 2015-08-04
5 3 D17.2 2014-04-02
6 3 E78.0 2015-04-02
7 3 G47.3 2011-04-02
8 3 H02.3 2005-06-05
9 3 H11.4 2014-04-02
10 4 K29.4 2021-12-03
11 4 K29.6 2022-11-03
12 4 K30 2002-12-03
13 4 R11 2002-12-05
14 4 R19.4 2021-12-03
15 5 O80.0 2002-02-22
16 5 O99.0 2004-04-22
17 5 Z37.0 2003-02-22
Update: an alternative using looping to account for any number of injury codes
Here is a solution using looping with bind_rows()
to build the final data.frame. This makes the assumption that there will always be a Date_X
column to match each Inj_code_Y
column, and that Y will always be one greater than X.
First create an empty data.frame, and get the names of all the injury code columns:
out_loop <-
data.frame(
ID = NULL,
injury_code = NULL,
date = NULL
)
injuryCodes <- names(out)[grepl("Inj_code", names(out))]
For each injury code column, get the ID, injury code and corresponding date column, and append to data.frame:
for (x in injuryCodes) {
out_loop <-
bind_rows(
out_loop,
out[, c("ID", x, paste0("Date_", as.numeric(str_extract(x, "[0-9]+$")) - 1))] %>% setNames(c("ID", "injury_code", "date"))
)
}
out_loop <-
filter(out_loop, !is.na(injury_code)) %>%
arrange(ID, injury_code)
> out_loop
ID injury_code date
1 1 M75.1 2017-12-12
2 2 C54.1 2010-08-04
3 2 C55 2014-08-04
4 2 D25.9 2015-08-04
5 3 D17.2 2014-04-02
6 3 E78.0 2015-04-02
7 3 G47.3 2011-04-02
8 3 H02.3 2005-06-05
9 3 H11.4 2014-04-02
10 4 K29.4 2021-12-03
11 4 K29.6 2022-11-03
12 4 K30 2002-12-03
13 4 R11 2002-12-05
14 4 R19.4 2021-12-03
15 5 O80.0 2002-02-22
16 5 O99.0 2004-04-22
17 5 Z37.0 2003-02-22