Suppose I want to exclude values matching a series of regular expressions from a character vector, in the same way that I would use setdiff()
for fixed character strings, e.g.
value <- c("apple pie", "cat", "dog", "dogmatic", "no apples")
re_setdiff(value, c("^apple", "^dog"))
## desired results:
value[c(2,5)]
[1] "cat" "no apples"
I know how I can code this by brute force (see my answer) but am wondering if there's a more efficient/more idiomatic way to do it (maybe something in stringi
/stringr
?), or something that's already in a (widely used) package?
You are right that it can be done by Reduce
> value <- c("apple pie", "cat", "dog", "dogmatic", "no apples")
> exclude <- c("^apple", "^dog")
> Reduce(\(x, y) grep(y, x, value = TRUE, invert = TRUE), exclude, value)
[1] "cat" "no apples"
jofrhwld <- \(value, exclude) {
str_subset(
value,
# concat into 1 regex
pattern = str_c(exclude, collapse = "|"),
negate = TRUE
)
}
tic <- \(value, exclude) {
Reduce(\(x, y) grep(y, x, value = TRUE, invert = TRUE), exclude, value)
}
darrentsai <- \(value, exclude) {
value[!rowSums(sapply(exclude, grepl, value))]
}
benbolker <- function(x, y, ...) {
for (yy in y) {
x <- grep(yy, x, invert = TRUE, value = TRUE, ...)
}
return(x)
}
microbenchmark(
jofrhwld = jofrhwld(value, exclude),
tic = tic(value, exclude),
darrentsai = darrentsai(value, exclude),
benbolker = benbolker(value, exclude),
unit = "relative",
check = "equal"
)
shows
Unit: relative
expr min lq mean median uq max neval
jofrhwld 3.298173 3.365282 4.302031 3.267003 4.027422 11.35689 100
tic 1.326923 1.334713 3.154308 1.366346 1.429391 14.58037 100
darrentsai 2.548269 2.761838 3.840466 2.606786 2.723790 17.21892 100
benbolker 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 100