Search code examples
rgsubabbreviation

R: Replace Abbreviations\ Words


I have tried to resolve this problem all day but without any improvement.

I am trying to replace the following abbreviations into the following desired words in my dataset:

-Abbreviations: USA, H2O, Type 3, T3, bp

  • Desired words United States of America, Water, Type 3 Disease, Type 3 Disease, blood pressure

The input data is for example

  • [1] I have type 3, its considered the highest severe stage of the disease.

  • [2] Drinking more H2O will make your skin glow.

  • [3] Do I have T2 or T3? Please someone help.

  • [4] We don't have this on the USA but I've heard that will be available in the next 3 years.

  • [5] Having a high bp means that I will have to look after my diet?

The desired output is

  • [1] i have type 3 disease, its considered the highest severe stage of the disease.

  • [2] drinking more water will make your skin glow.

  • [3] do I have type 3 disease? please someone help.

  • [4] we don't have this in the united states of america but i've heard that will be available in the next 3 years.

  • [5] having a high blood pressure means that I will have to look after my diet?

I have tried the following code but without success:

   data= read.csv(C:"xxxxxxx, header= TRUE")
   lowercase= tolower(data$MESSAGE)
   dict=list("\\busa\\b"= "united states of america", "\\bh2o\\b"= 
   "water", "\\btype 3\\b|\\bt3\\"= "type 3 disease", "\\bbp\\b"= 
   "blood pressure")
   for(i in 1:length(dict1)){
   lowercasea= gsub(paste0("\\b", names(dict)[i], "\\b"), 
   dict[[i]], lowercase)}

I know that I am definitely doing something wrong. Could anyone guide me on this? Thank you in advance.


Solution

  • If you need to replace only whole words (e.g. bp in Some bp. and not in bpcatalogue) you will have to build a regular expression out of the abbreviations using word boundaries, and - since you have multiword abbreviations - also sort them by length in the descending order (or, e.g. type may trigger a replacement before type three).

    An example code:

    abbreviations <- c("USA", "H2O", "Type 3", "T3", "bp")
    desired_words <- c("United States of America", "Water", "Type 3 Disease", "Type 3 Disease", "blood pressure")
    df <- data.frame(abbreviations, desired_words, stringsAsFactors = FALSE)
    x <- 'Abbreviations: USA, H2O, Type 3, T3, bp'
    sort.by.length.desc <- function (v) v[order( -nchar(v)) ]
    
    library(stringr)
    str_replace_all(x, 
        paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b"), 
        function(z) df$desired_words[df$abbreviations==z][[1]][1]
    ) 
    

    The paste0("\\b(",paste(sort.by.length.desc(abbreviations), collapse="|"), ")\\b") code creates a regex like \b(Type 3|USA|H2O|T3|bp)\b, it matches Type 3, or USA, etc. as whole word only as \b is a word boundary. If a match is found, stringr::str_replace_all replaces it with the corresponding desired_word.

    See the R demo online.