Search code examples
rregextidyrstringrtidytext

Finding Abbreviations in Data with R


In my data (which is text), there are abbreviations.

Is there any functions or code that search for abbreviations in text? For example, detecting 3-4-5 capital letter abbreviations and letting me count how often they happen.

Much appreciated!


Solution

  • detecting 3-4-5 capital letter abbreviations

    You may use

    \b[A-Z]{3,5}\b
    

    See the regex demo

    Details:

    • \b - a word boundary
    • [A-Z]{3,5} - 3, 4 or 5 capital letters (use [[:upper:]] to match letters other than ASCII, too)
    • \b - a word boundary.

    R demo online (leveraging the regex occurrence count code from @TheComeOnMan)

    abbrev_regex <- "\\b[A-Z]{3,5}\\b";
    x <- "XYZ was seen at WXYZ with VWXYZ and did ABCDEFGH."
    sum(gregexpr(abbrev_regex,x)[[1]] > 0)
    ## => [1] 3
    regmatches(x, gregexpr(abbrev_regex, x))[[1]]
    ## => [1] "XYZ"   "WXYZ"  "VWXYZ"