Search code examples
rregexstringset-difference

analog of setdiff() using regular expressions


Suppose I want to exclude values matching a series of regular expressions from a character vector, in the same way that I would use setdiff() for fixed character strings, e.g.

value <- c("apple pie", "cat", "dog", "dogmatic", "no apples")
re_setdiff(value, c("^apple", "^dog"))
## desired results:
value[c(2,5)]
[1] "cat"       "no apples"

I know how I can code this by brute force (see my answer) but am wondering if there's a more efficient/more idiomatic way to do it (maybe something in stringi/stringr?), or something that's already in a (widely used) package?


Solution

  • You are right that it can be done by Reduce

    > value <- c("apple pie", "cat", "dog", "dogmatic", "no apples")
    
    > exclude <- c("^apple", "^dog")
    
    > Reduce(\(x, y) grep(y, x, value = TRUE, invert = TRUE), exclude, value)
    [1] "cat"       "no apples"
    

    Benchmarking

    jofrhwld <- \(value, exclude) {
        str_subset(
            value,
            # concat into 1 regex
            pattern = str_c(exclude, collapse = "|"),
            negate = TRUE
        )
    }
    
    tic <- \(value, exclude) {
        Reduce(\(x, y) grep(y, x, value = TRUE, invert = TRUE), exclude, value)
    }
    
    darrentsai <- \(value, exclude) {
        value[!rowSums(sapply(exclude, grepl, value))]
    }
    
    benbolker <- function(x, y, ...) {
        for (yy in y) {
            x <- grep(yy, x, invert = TRUE, value = TRUE, ...)
        }
        return(x)
    }
    
    microbenchmark(
        jofrhwld = jofrhwld(value, exclude),
        tic = tic(value, exclude),
        darrentsai = darrentsai(value, exclude),
        benbolker = benbolker(value, exclude),
        unit = "relative",
        check = "equal"
    )
    

    shows

    Unit: relative
           expr      min       lq     mean   median       uq      max neval
       jofrhwld 3.298173 3.365282 4.302031 3.267003 4.027422 11.35689   100
            tic 1.326923 1.334713 3.154308 1.366346 1.429391 14.58037   100
     darrentsai 2.548269 2.761838 3.840466 2.606786 2.723790 17.21892   100
      benbolker 1.000000 1.000000 1.000000 1.000000 1.000000  1.00000   100