I have a dataset which has 100 diagnosis fields, stored as character. They are named DiagnosisCode1
through to DiagnosisCode100
. In most cases, the first diagnosis field is complete, but not always. Higher diagnosis fields are almost universally empty as NA
. I need to strip the first character off each of these fields and concatenate them in a single column which will be named OnsetFlags
. Because of the large number of columns I will be referring to, I would prefer not to explicitly reference all 100 fields, but refer to it either using a dplyr
helper - such as starts_with()
- or by using a range, like DiagnosisCode1:DiagnosisCode100
.
For the purposes of a concise demonstration, the below dataset uses 5 diagnosis columns:
example = structure(list(id = c("1", "2", "3", "4",
"5", "6", "7"), DiagnosisCode1 = c("2G56.2",
"2M48.06", "2G56.2", "2G56.0", "2S83.53", "2M23.20", "2S83.53"
), DiagnosisCode2 = c("2G56.0", "2G55.3", "2G56.0", "2G56.2",
"2Y92.82", "2Z86.43", "2S83.2"), DiagnosisCode3 = c("2Z86.43",
"2Z86.43", "2Z86.43", "1J98.1", "2V93.8", NA, "2W19"), DiagnosisCode4 = c(NA,
"2U82.3", NA, "2U80.2", "2U73.8", NA, "2Y92.39"), DiagnosisCode5 = c(NA,
"2U83.3", NA, NA, NA, NA, "2U56.39")), row.names = c(NA, -7L), class = "data.frame")
Which produces:
id DiagnosisCode1 DiagnosisCode2 DiagnosisCode3 DiagnosisCode4 DiagnosisCode5
1 1 2G56.2 2G56.0 2Z86.43 <NA> <NA>
2 2 2M48.06 2G55.3 2Z86.43 2U82.3 2U83.3
3 3 2G56.2 2G56.0 2Z86.43 <NA> <NA>
4 4 2G56.0 2G56.2 1J98.1 2U80.2 <NA>
5 5 2S83.53 2Y92.82 2V93.8 2U73.8 <NA>
6 6 2M23.20 2Z86.43 <NA> <NA> <NA>
7 7 2S83.53 2S83.2 2W19 2Y92.39 2U56.39
id DiagnosisCode1 DiagnosisCode2 DiagnosisCode3 DiagnosisCode4 DiagnosisCode5 OnsetFlags
1 1 G56.2 G56.0 Z86.43 <NA> <NA> 222
2 2 M48.06 G55.3 Z86.43 U82.3 U83.3 22222
3 3 G56.2 G56.0 Z86.43 <NA> <NA> 222
4 4 G56.0 G56.2 J98.1 U80.2 <NA> 2212
5 5 S83.53 Y92.82 V93.8 U73.8 <NA> 222
6 6 M23.20 Z86.43 <NA> <NA> <NA> 22
7 7 S83.53 S83.2 W19 Y92.39 U56.39 22222
So far, I have tried my usual bag of tricks with dplyr
across()
:
# create onset flag array
example = example %>%
mutate(OnsetFlags = across(starts_with("DiagnosisCode"), ~ as.character(substr(., 1, 1))
))
However, this creates a tibble column which itself contains multiple OnsetFlags columns equal to the number of DiagnosisCode columns. I'm having difficulty pulling these out of the tibble and concatenating into a dataframe column and, honestly, it feels like it's adding a lot of overhead for what I thought should be a simple process.
The closest other examples I've found on SO were:
...but I don't seem to be able to unite
the tibble columns.
Is there an easier way of doing this because I don't feel like I'm on the right path?
If this absolutely must be done in the wide format, I'd avoid apply
(which loops over each and every row and will bog down on large datasets), and nest to deal with the substr
/unite
processing in the standard vectorised R fashion. Something like this should work, though it could probably be improved.
example %>%
mutate(onset = example %>%
select(starts_with("DiagnosisCode")) %>%
map_df(substr,1,1) %>%
unite("onset", na.rm=TRUE, sep="")
%>% pull(onset)
)
The 'tidy' approach however would be to not work with wide 'untidy' data. Instead, pivot_longer
this into a long, 'tidy' diagnosis dataset.
Any summary information from your long diagnosis dataset can be joined on at the appropriate measurement level. This allows processing/analysis of grouped diagnoses and onset flags to be done in a very simple manner that doesn't get bogged down in variable prefix selections, apply/across logic, or needing to handle NA
values because of the unbalanced number of diagnoses per event (note values_drop_NA=TRUE
when pivoting).
diagnosis <- example %>%
pivot_longer(-id,
names_pattern=".+(\\d+)",
names_to="seq",
values_to="code",
values_drop_na=TRUE) %>%
mutate(onset = as.integer(substr(code, 1, 1)),
code = substr(code, 2, nchar(code)))
diagnosis
## A tibble: 26 × 4
# id seq code onset
# <chr> <chr> <chr> <int>
# 1 1 1 G56.2 2
# 2 1 2 G56.0 2
# 3 1 3 Z86.43 2
# 4 2 1 M48.06 2
# 5 2 2 G55.3 2
# 6 2 3 Z86.43 2
# 7 2 4 U82.3 2
# 8 2 5 U83.3 2
# 9 3 1 G56.2 2
#10 3 2 G56.0 2
## … with 16 more rows
Your current analysis task becomes a lot simpler:
diagnosis %>%
group_by(id) %>%
summarise(onset = paste(onset,collapse=""))
## A tibble: 7 × 2
# id onset
# <chr> <chr>
#1 1 222
#2 2 22222
#3 3 222
#4 4 2212
#5 5 2222
#6 6 22
#7 7 22222
Though I'd argue that combining n onset values violates most good practices for organising data where each cell should contain only one value, making future analysis more difficult as you'd have to then strsplit
the values to make any sense of them. Better to leave it in the long format and analyse them as they are.
Here's another example of how to deal with other more complicated analyses in the long form:
diagnosis %>%
group_by(id) %>%
summarise(onset1 = any(onset == 1),
G5and2 = any(substr(code,1,2) == 'G5' & onset == 2))
## A tibble: 7 × 3
# id onset1 G5and2
# <chr> <lgl> <lgl>
#1 1 FALSE TRUE
#2 2 FALSE TRUE
#3 3 FALSE TRUE
#4 4 TRUE TRUE
#5 5 FALSE FALSE
#6 6 FALSE FALSE
#7 7 FALSE FALSE