I want to match a dictionary of strings to a dictionary of patterns very quickly.
I have 10000 inputs of length 255, and 10000 patterns of length 20. I need to match these strings very quickly and I need to know which patterns matched which inputs.
My patterns need to support regular expression matching. Right now, I have everything in a mysql database and I'm using a regex call in mysql to do what I need. Ex:
patterns.each do |p|
select * from inputs where inputs.value regex #{p}
end
but this will get slower and slower the more patterns I have. I also considered:
patterns.each do |p|
inputs.grep(p)
end
but it has the same issue when I have too many patterns. I want to create a DFA in ruby and store it in memory (separate worker process) until it's needed, but I don't if there's any resources available for doing something like this.
If you don't need the matches, just union the regexps; then the speed is in C. Proof of concept:
# prepwork:
# make 676 two-letter patterns plus a digit
patterns = ('a'..'z').flat_map { |x| ('a'..'z').map { |y| x + y + "\\d" } }
# convert to regexp
regexp = Regexp.union(patterns.map{ |pattern| /(#{pattern})/ })
# matching:
# match
match = regexp.match("mu5")
# find out which pattern was hit
if match
puts patterns[match.captures.find_index { |x| x }]
# => mu\d
end
If you need to see which pattern was hit, you either can't use any capturing brackets, or you need to rewrite my example to use named captures instead of plain ones, which shouldn't be too hard.
EDIT: If you don't need to see which pattern was hit, you can probably speed it up by removing the parentheses from the regexp, replacing /(#{pattern})/
with /#{pattern}/
, so that matching does not need to create a huge array each time.