Search code examples
rstringset-difference

Comparing string differences between two columns in the same row in R, then printing any difference(s) between the two in a new column


I'm working on automating toxicology read-in and import for unique case IDs from flat .csv files in R/RStudio. I've been asked to generate a spreadsheet that compares the differences between the manual toxicology process and the automated process.

I have a data.frame/tibble with a casenumber (the unique ID) column, a column containing the substances created during the automated read-in (auto.tox in the example), and a column containing the current, "production" toxicology substances that were created manually by someone reading and physically typing in substances that returned a result (production.tox in the example).

I need to compare the "auto.tox" column and the "production.tox" columns below to assess for precise string differences between each column in each row (for each casenumber). I believe example data will better help illustrate what I mean.

df <- tribble(
  ~casenumber, ~auto.tox, ~production.tox,
  "2023-1",   c("Gabapentin, Alprazolam, Ethanol"), c("Alprazolam, Gabapentin, Ethanol"),
  "2023-2",   c("Alprazolam, Fentanyl, Ethanol"), c("Fentanyl, Ethanol"),
  "2023-3",   c("Fentanyl, Norfentanyl, 4-ANPP, Acetyl Fentanyl"), c("Fentanyl, Norfentanyl"))

I would like to add a new column called "different_substances," that contains any substance(s) present in the "auto.tox" column that are not present in the "production.tox" column for each casenumber.

I've tried:

  • using dplyr::anti_join() with separate data.frames with substances by case and joining on casenumber
  • using list(setdiff(auto.tox, production.tox))
  • mapply(vecsets::vsetdiff())

None of which gave me the precise output I'm looking for.

The desired output is:

casenumber  different_substances
"2023-1"      NA or "None" #(order doesn't matter)
"2023-2"     "Alprazolam"
"2023-3"     "4-ANPP, Acetyl Fentanyl"

I realize the last substance for 2023-3 should be spelled "acetylfentanyl" but our lookup table has a few misspellings, so the separate word is important, and it's important for "Fentanyl" alone to be recognized as different than "Acetyl Fentanyl".

I'm open to any general suggestions as well and also changing the format of the two substance columns to a more efficient format if needed.

If anyone can help, I'd greatly appreciate it. Thank you.


Solution

  • Here is one approach:

    library(tidyverse)
    # function to split the columns and remove whitespace
    
    f_split = function(x) {
        lapply(str_split(x, ","), str_trim)
    }
    
    ## function to find chemicals not in one set
    overlap = function(x, y) x[!x %in% y]
    
    # apply f_split
    auto_tox = f_split(df$auto.tox)
    production_tox = f_split(df$production.tox)
    
    
    df %>% mutate(diff = sapply(mapply(overlap, auto_tox, production_tox), paste0, collapse=", ")) %>%
        mutate(diff = if_else(diff == "", "None", diff))
    
    # A tibble: 3 × 5
      casenumber auto.tox
      <chr>      <chr>
    1 2023-1     Gabapentin, Alprazolam, Ethanol
    2 2023-2     Alprazolam, Fentanyl, Ethanol
    3 2023-3     Fentanyl, Norfentanyl, 4-ANPP, Acetyl Fentanyl
      production.tox                  difff
      <chr>                           <chr>
    1 Alprazolam, Gabapentin, Ethanol ""
    2 Fentanyl, Ethanol               "Alprazolam"
    3 Fentanyl, Norfentanyl           "4-ANPP,Acetyl Fentanyl"
      diff
      <chr>
    1 None
    2 Alprazolam
    3 4-ANPP,Acetyl Fentanyl