Efficiently filter strings based on a set of substrings

My general question is how to filter an arbitrary set of strings against an arbitrary set of possible substrings.

My particular problem is that I have long lists of filenames that contain UUIDs, and I have a list of other UUIDs that I want to filter out from those. My datasets can vary in terms of length and what proportion of them end up hitting a match, from n = 10 to 1,000,000, and hits from 0% to 100%. I don't think the match rate is a big factor in the performance of my approach, but I know that size does matter for it.

My naive approach is below. Essentially, from a vector of UUIDs, I concatenate them all together with a | to make a single long regex and test every member of the original data against it. This is very fast for small data, but rapidly gets out of hand.

Setup:

# create a master population to choose from
uuid_list <- uuid::UUIDgenerate(n = 1e5)

filter_from_string <- function(n) {
  
  # choose a representative population
  uuid_population <- sample(uuid_list, n)
  
  # generate the longer strings
  data_strings <- 
    paste0(
      "Here_are_some_strings_with_associated_UUIDs_",
      uuid_population,
      "_and_additional_metadata.json"
    )
  
  # generate a small sample to check against
  uuid_sample <- sample(uuid_population, n/10)
  
  # check against a single long regex that looks like:
  #   "uuid1|uuid2|uuid3|..."
  filter_index <- 
    stringr::str_detect(
      data_strings,
      paste(uuid_sample, collapse = "|")
    )
  
  data_strings[!filter_index]
}

Test:

x <- 
  microbenchmark::microbenchmark(
    "n=10" = filter_from_string(10),
    "n=100" = filter_from_string(100),
    "n=1000" = filter_from_string(1000),
    "n=10000" = filter_from_string(10000),
    times = 10
  )
#> Warning in microbenchmark::microbenchmark(`n=10` = filter_from_string(10), :
#> less accurate nanosecond times to avoid potential integer overflows

# milliseconds
summary(x, "ms")[c("expr", "mean", "median")]
#>      expr         mean       median
#> 1    n=10    0.4201393    0.0713195
#> 2   n=100    1.4376527    0.4230585
#> 3  n=1000   25.4073679   25.8098075
#> 4 n=10000 2340.1810916 2313.1806605

# relative time
summary(x, "relative")[c("expr", "mean", "median")]
#>      expr        mean       median
#> 1    n=10    1.000000     1.000000
#> 2   n=100    3.421848     5.931877
#> 3  n=1000   60.473676   361.889911
#> 4 n=10000 5570.012354 32434.056051

^{Created on 2023-05-03 with reprex v2.0.2}

As you can see, the performance is much longer than 10x each step.

I have a fix for how to speed it up in my particular use case. I can extract out the UUIDs from the string since it's a known pattern, then do an exact equality match against the sample. I'll post this in an answer below, but I'm curious about what to do when there's not an obvious pattern to use, but rather any population of strings and possible substrings.

Solution

For this specific use case with a well-behaved pattern, you can extract out the target substring from the longer names, and then do exact equality matching against the samples.

Setup:

# create a master population to choose from
uuid_list <- uuid::UUIDgenerate(n = 1e5)

filter_from_string_extract <- function(n) {
  
  # choose a representative population
  uuid_population <- sample(uuid_list, n)
  
  # generate the longer strings
  data_strings <- paste0(
        "Here_are_some_strings_with_associated_UUIDs_",
        uuid_population,
        "_and_additional_metadata.json"
      )
  
  # pre-extract just the UUID into a new vector
  data_uuids <- 
    stringr::str_extract(
      data_strings,
      "[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}"
    )
  
  # generate a small sample to check against
  uuid_sample <- sample(uuid_population, n/10)
  
  # filter on exact match of the UUID instead of regex pattern
  data_strings[!(data_uuids %in% uuid_sample)]
}

Test:

y <- 
  microbenchmark::microbenchmark(
    "n=10" = filter_from_string_extract(10),
    "n=100" = filter_from_string_extract(100),
    "n=1000" = filter_from_string_extract(1000),
    "n=10000" = filter_from_string_extract(10000),
    "n=100000" = filter_from_string_extract(100000),
    times = 10
  )
#> Warning in microbenchmark::microbenchmark(`n=10` =
#> filter_from_string_extract(10), : less accurate nanosecond times to avoid
#> potential integer overflows

# milliseconds
summary(y, "ms")[c("expr", "mean", "median")]
#>       expr        mean      median
#> 1     n=10   0.0994824   0.0759730
#> 2    n=100   0.2499114   0.2374515
#> 3   n=1000   1.8917400   1.8155005
#> 4  n=10000  18.4703319  17.4055865
#> 5 n=100000 204.7527167 199.9893490

# relative time
summary(y, "relative")[c("expr", "mean", "median")]
#>       expr        mean      median
#> 1     n=10    1.000000    1.000000
#> 2    n=100    2.512117    3.125472
#> 3   n=1000   19.015826   23.896654
#> 4  n=10000  185.664318  229.102267
#> 5 n=100000 2058.180308 2632.373988

^{Created on 2023-05-03 with reprex v2.0.2}

Even going up another order of magnitude, there's acceptable performance, and it appears to scale linearly with n.