I recently learnt that ()
can be used for define patterns in regular expression, and use \\1
can refer to the first set of string defined in the ()
.
It is a powerful idea because I can use it to extract two words that start with the same letter. For example:
# Load the package:
library(stringr)
# Define a list of example sentences:
sentences <- c("She likes butter chicken.",
"He loves mango.",
"Lara makes nuggets.",
"We want help.")
# Extract the matched two words, which follows the pattern defined below:
str_extract_all(string = sentences,
pattern = regex(pattern = "\\b(\\w)\\w*\\s\\b\\1\\w*\\b",
ignore_case = TRUE))
Here is a breakdown of the regular expression \\b(\\w)\\w*\\s\\b\\1\\w*\\b
I used:
\\b
: word boundary where the first word starts
(\\w)
: word character as the first pattern, here it refers to the first letter of the first word
\\w*
: more word characters, here it refers to the rest letter of the first word
\\b
: word boundary where the first word ends
\\s
: whitespace
\\b
: word boundary where the second word starts
\\1
: the first pattern, here it refers to the first letter of the second word
\\w*
: more word characters, here it refers to the rest letter of the second word
\\b
: word boundary where the second word ends
And I will get the expected result:
[[1]]
character(0)
[[2]]
character(0)
[[3]]
character(0)
[[4]]
[1] "We want"
I am just wondering, how to modify the regular expression so that it can extract two words that have alphabetically-adjacent first letter? For example, I'd expect to match these words (marked between *
):
c("She likes *butter chicken*.",
"He *loves mango*.",
"Lara *makes nuggets*.",
"We want help.")
butter chicken
, loves mango
and makes nuggets
will be matched because for each group, their first letters are adjacent^(e.g. b
is before c
).
^: Assuming descending order only (i.e. from A -> Z).
I hope my description is clear, and I'd appreciate any tips to achieve the desired results. Thank you.
1) Create a function conseq
which accepts a character vector of length 2, x
, and returns it if the first letter of x[2]
immediately follows the first letter of x[1]
in the alphabet and returns NULL otherwise. Then apply it to every successive pair of words. We assume
Code
library (zoo)
conseq <- function(x) {
first <- substring(tolower(x), 1, 1)
if (isTRUE(match(first[2], letters) ==
match(first[1], letters) + 1)) paste(x[1], x[2])
}
sentences |>
strsplit("[^a-zA-Z]+") |>
lapply(rollapply, width = 2, conseq)
giving
[[1]]
[1] "butter chicken"
[[2]]
[1] "loves mango"
[[3]]
[1] "Lara makes" "makes nuggets"
[[4]]
numeric(0)
If you want case sensitivity replace conseq
with
conseq <- function(x) {
first <- substring(x, 1, 1)
lets <- c(letters, "", "LETTERS")
if (isTRUE(match(first[2], lets) ==
match(first[1], lets) + 1)) paste(x[1], x[2])
}
2) In this one we assume only lower case letters are to be matched and only the first for each component should be returned. Components with no match result in an NA being returned for that component. We generate the regular expression, rx
, and then use str_extract
.
Code
library(stringr)
rx <- sprintf("\\b%s[a-z]* +%s[a-z]*", letters[-26], letters[-1]) |>
paste(collapse = "|")
str_extract(sentences, rx)
## [1] "butter chicken" "loves mango" "makes nuggets" NA
Input taken from question
sentences <- c("She likes butter chicken.",
"He loves mango.",
"Lara makes nuggets.",
"We want help.")
Added (2). Some improvements.