Search code examples
rregexstringr

How to use regular expression to match two words that have alphabetically-adjacent first letter?


I recently learnt that () can be used for define patterns in regular expression, and use \\1 can refer to the first set of string defined in the ().

It is a powerful idea because I can use it to extract two words that start with the same letter. For example:

# Load the package:
library(stringr)

# Define a list of example sentences:
sentences <- c("She likes butter chicken.",
               "He loves mango.",
               "Lara makes nuggets.",
               "We want help.")

# Extract the matched two words, which follows the pattern defined below:
str_extract_all(string = sentences, 
                pattern = regex(pattern = "\\b(\\w)\\w*\\s\\b\\1\\w*\\b",
                                ignore_case = TRUE))

Here is a breakdown of the regular expression \\b(\\w)\\w*\\s\\b\\1\\w*\\b I used:
\\b: word boundary where the first word starts
(\\w): word character as the first pattern, here it refers to the first letter of the first word
\\w*: more word characters, here it refers to the rest letter of the first word
\\b: word boundary where the first word ends
\\s: whitespace
\\b: word boundary where the second word starts
\\1: the first pattern, here it refers to the first letter of the second word
\\w*: more word characters, here it refers to the rest letter of the second word
\\b: word boundary where the second word ends

And I will get the expected result:

[[1]]
character(0)

[[2]]
character(0)

[[3]]
character(0)

[[4]]
[1] "We want"

I am just wondering, how to modify the regular expression so that it can extract two words that have alphabetically-adjacent first letter? For example, I'd expect to match these words (marked between *):

c("She likes *butter chicken*.",
  "He *loves mango*.",
  "Lara *makes nuggets*.",
  "We want help.")

butter chicken, loves mango and makes nuggets will be matched because for each group, their first letters are adjacent^(e.g. b is before c).

^: Assuming descending order only (i.e. from A -> Z).

I hope my description is clear, and I'd appreciate any tips to achieve the desired results. Thank you.


Solution

  • 1) Create a function conseq which accepts a character vector of length 2, x, and returns it if the first letter of x[2] immediately follows the first letter of x[1] in the alphabet and returns NULL otherwise. Then apply it to every successive pair of words. We assume

    • case insensitivity. (See code at end if case sensitivity is wanted.)
    • all possibly overlapping matches are wanted

    Code

    library (zoo)
    
    conseq <- function(x) {
      first <- substring(tolower(x), 1, 1)
      if (isTRUE(match(first[2], letters) ==
        match(first[1], letters) + 1)) paste(x[1], x[2])
    }
      
    sentences |>
      strsplit("[^a-zA-Z]+") |>
      lapply(rollapply, width = 2, conseq)
    

    giving

    [[1]]
    [1] "butter chicken"
    
    [[2]]
    [1] "loves mango"
    
    [[3]]
    [1] "Lara makes"    "makes nuggets"
    
    [[4]]
    numeric(0)
    

    If you want case sensitivity replace conseq with

    conseq <- function(x) {
      first <- substring(x, 1, 1)
      lets <- c(letters, "", "LETTERS")
      if (isTRUE(match(first[2], lets) ==
        match(first[1], lets) + 1)) paste(x[1], x[2])
    }
    

    2) In this one we assume only lower case letters are to be matched and only the first for each component should be returned. Components with no match result in an NA being returned for that component. We generate the regular expression, rx, and then use str_extract.

    Code

    library(stringr)
    
    rx <- sprintf("\\b%s[a-z]* +%s[a-z]*", letters[-26], letters[-1]) |>
             paste(collapse = "|")
    str_extract(sentences, rx)
    ## [1] "butter chicken" "loves mango"    "makes nuggets"  NA        
    

    Note

    Input taken from question

    sentences <- c("She likes butter chicken.",
                   "He loves mango.",
                   "Lara makes nuggets.",
                   "We want help.")
    

    Updates

    Added (2). Some improvements.