Search code examples
rregextext-miningtopic-modeling

How to clean abbreviations containing a "period-punctuation" ("e.g.", "st.", "rd.") but leave the "." at the end of a sentence?


I am working on a sentence-level LDA in R and am currently trying to split my text data into individual sentences with the sent_detect() function from the openNLP package.

However, my text data contains a lot of abbreviations that have a "period symbol" but do not mark the end of a sentence. Here are some examples: "st. patricks day", "oxford st.", "blue rd.", "e.g."

Is there a way to create a gsub() function to account for such 2-character abbreviations and remove their "."-symbol so that it is not wrongly detected by the sent_detect() function? Unfortunately, these abbreviations are not always in between two words but sometimes they could indeed also mark the end of a sentence:

Example:

"I really liked Oxford st." - the "st." marks the end of a sentence and the "." should remain.

vs

"Oxford st. was very busy." - the "st." does not stand at the end of a sentence, thus, the "."-symbol should be replaced.

I am not sure whether there is a solution for this, but maybe someone else who is more familiar with sentence-level analysis knows a way of how to deal with such issues. Thank you!


Solution

  • Looking at your previously asked questions, I would suggest looking into the textclean package. A lot of what you want has been included in that package. Any missing functions can be appropriated or reused or expanded upon.

    Just replacing "st." with something is going to lead to problems as it could mean street or saint, but "st. patricks day" is easy to find. The problem what you will have is to make a list of possible occurences and find alternatives for them. The easiest to use are translation tables. Below I create a table for a few abbreviations and their expected long names. Now it is up to you (or your client) to specify what you want as an end result. The best way is to create a table in excel or database and load this into a data.frame (and store somewhere for easy access). Depending on your text this might be a lot of work, but it will improve the quality of your outcome.

    Example:

    library(textclean)
    
    text <- c("I really liked Oxford st.", "Oxford st. was very busy.",
              "e.g. st. Patricks day was on oxford st. and blue rd.")
    
    
    # Create abbreviations table, making sure that we are looking for rd. and not just rd. Also should it be road or could it mean something else?
    
    abbreviations <- data.frame(abbreviation = c("st. patricks day", "oxford st.", "rd\\.", "e.g."),
                                replacement = c("saint patricks day","oxford street","road", "eg"))
    
    
    # I use the replace_contraction function since you can replace the default contraction table with your own table.
    
    text <- replace_contraction(text, abbreviations)
    
    text
    [1] "I really liked oxford street"                             "oxford street was very busy."                            
    [3] "eg saint patricks day was on oxford street and blue road"
    
    # as the result from above show missing end marks we use the following function to add them again.
    
    text <- add_missing_endmark(text, ".")
    
    text
    [1] "I really liked oxford street."                             "oxford street was very busy."                             
    [3] "eg saint patricks day was on oxford street and blue road."
    

    textclean has a range of replace_zzz functions, most are based on the mgsub function that is in the package. Check the documentation with all the functions to get an idea of what they do.