Search code examples
rubystring-matchingdfa

Ruby Scalable String Match


I want to match a dictionary of strings to a dictionary of patterns very quickly.

I have 10000 inputs of length 255, and 10000 patterns of length 20. I need to match these strings very quickly and I need to know which patterns matched which inputs.

My patterns need to support regular expression matching. Right now, I have everything in a mysql database and I'm using a regex call in mysql to do what I need. Ex:

patterns.each do |p|
  select * from inputs where inputs.value regex #{p}
end

but this will get slower and slower the more patterns I have. I also considered:

patterns.each do |p|
  inputs.grep(p)
end

but it has the same issue when I have too many patterns. I want to create a DFA in ruby and store it in memory (separate worker process) until it's needed, but I don't if there's any resources available for doing something like this.


Solution

  • If you don't need the matches, just union the regexps; then the speed is in C. Proof of concept:

    # prepwork:
    # make 676 two-letter patterns plus a digit
    patterns = ('a'..'z').flat_map { |x| ('a'..'z').map { |y| x + y + "\\d" } }
    # convert to regexp
    regexp = Regexp.union(patterns.map{ |pattern| /(#{pattern})/ })
    
    # matching:
    # match
    match = regexp.match("mu5")
    # find out which pattern was hit
    if match
      puts patterns[match.captures.find_index { |x| x }]
      # => mu\d
    end
    

    If you need to see which pattern was hit, you either can't use any capturing brackets, or you need to rewrite my example to use named captures instead of plain ones, which shouldn't be too hard.

    EDIT: If you don't need to see which pattern was hit, you can probably speed it up by removing the parentheses from the regexp, replacing /(#{pattern})/ with /#{pattern}/, so that matching does not need to create a huge array each time.