Search code examples
rapplytext-parsing

How do you apply a function to each cell of a column?


I am using the following perfectly working function to parse through text data to find the percentage stenosis of arteries in patient medical records.

txt <- "Small caliber RCA with 50% proximal and 70% mid stenoses."

coronary_anatomy <- function(x) {
    
    # Check if sentence
    if(!is.character(x)) {stop("Requires character string", call. = FALSE)}
    
    # Establish variables
    epicardial <- c("LM", "LAD", "LCX", "RCA")
    mods <- c("proximal", "mid", "distal", "ostial")
    
    sentence <-
        tibble(line = 1, sentence = x) %>%
        tidytext::unnest_tokens(input = sentence, output = word, to_lower = FALSE) %>%
        pull(word)
    
    # Identify number/locations of disease
    artery <- sentence[which(sentence %in% epicardial)]
    locs <- grep("\\d+", sentence) 
    mlocs <- which(sentence %in% mods)
    
    # Find the nearest neighbors to identify which modifier goes with which location
    space <- combn(mlocs, length(locs))
    dist <- apply(space, 2, function(x) {sum(abs(locs - x))})
    matched <- space[, which.min(dist)]
    
    tbl <- 
        tibble(
            anatomy = paste(sentence[matched], artery),
            stenosis = as.numeric(sentence[locs])
        )
    
    # Return
    return(tbl)
}

# Test it out

coronary_anatomy(txt)

Output:

# A tibble: 2 x 2
anatomy      stenosis
<chr>           <dbl>
1 proximal RCA       50
2 mid RCA            70

The code works great. But now I am running into issues applying it on a larger scale. I want to apply this code to a data frame with a whole column of patient medical records. A simplified data frame of the data frame I want to run the function through is shown below.

# A tibble: 2 x 2
PatientID      Records
<chr>           <chr>
1 1234            Small caliber RCA with 50% proximal and 70% mid stenoses
2 1235            Small caliber LCX with 40% proximal and 70% mid stenoses

So now comes the issue. I want to somehow run this function through the entire records column. However, running this function(as shown above) outputs a tibble that will vary in size depending on how how much info is available to parse.

Does anyone smarter than me have any idea how to run this function through each cell in a column of a data table containing medical records, and output it in an organized manner, given that the output is a tibble?


Solution

  • If speed isn't an issue, you can use lapply or a purrr::map function (or even a for loop) to go through each row of your data, saving each tibble result in a list, and then combine the list of tibbles into a nice big tibble to work with. E.g.,

    # dplyr and lapply
    result_list = lapply(your_data$Records, coronary_anatomy)
    names(result_list) = your_data$PatientID
    result_tbl = bind_rows(result_list, .id = "PatientID")
    result_tbl
    # # A tibble: 4 x 3
    #   PatientID anatomy      stenosis
    #   <chr>     <chr>           <dbl>
    # 1 1234      proximal RCA       50
    # 2 1234      mid RCA            70
    # 3 1235      proximal LCX       40
    # 4 1235      mid LCX            70
    

    If you're using dplyr version 1.0 or higher, you can also do this simply with group_by and summarize:

    your_data %>% 
      group_by(PatientID) %>% 
      summarize(coronary_anatomy(Records))
    # `summarise()` regrouping output by 'PatientID' (override with `.groups` argument)
    # # A tibble: 4 x 3
    # # Groups:   PatientID [2]
    #   PatientID anatomy      stenosis
    #       <int> <chr>           <dbl>
    # 1      1234 proximal RCA       50
    # 2      1234 mid RCA            70
    # 3      1235 proximal LCX       40
    # 4      1235 mid LCX            70