I am using the following perfectly working function to parse through text data to find the percentage stenosis of arteries in patient medical records.
txt <- "Small caliber RCA with 50% proximal and 70% mid stenoses."
coronary_anatomy <- function(x) {
# Check if sentence
if(!is.character(x)) {stop("Requires character string", call. = FALSE)}
# Establish variables
epicardial <- c("LM", "LAD", "LCX", "RCA")
mods <- c("proximal", "mid", "distal", "ostial")
sentence <-
tibble(line = 1, sentence = x) %>%
tidytext::unnest_tokens(input = sentence, output = word, to_lower = FALSE) %>%
pull(word)
# Identify number/locations of disease
artery <- sentence[which(sentence %in% epicardial)]
locs <- grep("\\d+", sentence)
mlocs <- which(sentence %in% mods)
# Find the nearest neighbors to identify which modifier goes with which location
space <- combn(mlocs, length(locs))
dist <- apply(space, 2, function(x) {sum(abs(locs - x))})
matched <- space[, which.min(dist)]
tbl <-
tibble(
anatomy = paste(sentence[matched], artery),
stenosis = as.numeric(sentence[locs])
)
# Return
return(tbl)
}
# Test it out
coronary_anatomy(txt)
Output:
# A tibble: 2 x 2
anatomy stenosis
<chr> <dbl>
1 proximal RCA 50
2 mid RCA 70
The code works great. But now I am running into issues applying it on a larger scale. I want to apply this code to a data frame with a whole column of patient medical records. A simplified data frame of the data frame I want to run the function through is shown below.
# A tibble: 2 x 2
PatientID Records
<chr> <chr>
1 1234 Small caliber RCA with 50% proximal and 70% mid stenoses
2 1235 Small caliber LCX with 40% proximal and 70% mid stenoses
So now comes the issue. I want to somehow run this function through the entire records column. However, running this function(as shown above) outputs a tibble that will vary in size depending on how how much info is available to parse.
Does anyone smarter than me have any idea how to run this function through each cell in a column of a data table containing medical records, and output it in an organized manner, given that the output is a tibble?
If speed isn't an issue, you can use lapply
or a purrr::map
function (or even a for loop) to go through each row of your data, saving each tibble result in a list
, and then combine the list of tibbles into a nice big tibble to work with. E.g.,
# dplyr and lapply
result_list = lapply(your_data$Records, coronary_anatomy)
names(result_list) = your_data$PatientID
result_tbl = bind_rows(result_list, .id = "PatientID")
result_tbl
# # A tibble: 4 x 3
# PatientID anatomy stenosis
# <chr> <chr> <dbl>
# 1 1234 proximal RCA 50
# 2 1234 mid RCA 70
# 3 1235 proximal LCX 40
# 4 1235 mid LCX 70
If you're using dplyr
version 1.0 or higher, you can also do this simply with group_by
and summarize
:
your_data %>%
group_by(PatientID) %>%
summarize(coronary_anatomy(Records))
# `summarise()` regrouping output by 'PatientID' (override with `.groups` argument)
# # A tibble: 4 x 3
# # Groups: PatientID [2]
# PatientID anatomy stenosis
# <int> <chr> <dbl>
# 1 1234 proximal RCA 50
# 2 1234 mid RCA 70
# 3 1235 proximal LCX 40
# 4 1235 mid LCX 70