I have a function that takes an input string and then runs the string through several regular expressions until it finds a match. Once a match is found, I return an output that is a function of the original string and the match. So in ruby:
str = "my very long original string ... millions of characters"
case str
when regex1
do_something1(str,$1)
when regex2
do_something2(str,$1)
when regex3
do_something3(str,$1)
...
when regex100000
do_something100000(str,$1)
else
do_something_else(str)
end
Now, is Ruby actually optimizing this switch loop so that str only gets traversed once? Assuming it isn't, then this functionality could be performed much more efficiently with one big long regular expression that had embedded callbacks. Something like this:
/(?<callback:do_something1>regex1)|
(?<callback:do_something2>regex2)|
(?<callback:do_something3>regex3)|
...
(?<callback:do_something100000>regex100000)/
Is there any technology that does this?
If Ruby 1.9, and if you use named groups in your regular expressions, then you can combine all of the regular expressions into one with a bit of trickery. Here's the class that does the heavy lifting:
class BigPatternMatcher
def initialize(patterns_and_functions, no_match_function)
@regex = make_big_regex(patterns_and_functions)
@no_match_function = no_match_function
end
def match(s, context)
match = @regex.match(s)
context.send(function_name(match), match)
end
private
FUNC_GROUP_PREFIX = "func_"
FUNC_GROUP_REGEX = /^#{FUNC_GROUP_PREFIX}(.*)$/
def function_name(match)
if match
match.names.grep(FUNC_GROUP_REGEX).find do |name|
match[name]
end[FUNC_GROUP_REGEX, 1]
else
@no_match_function
end
end
def make_big_regex(patterns_and_functions)
patterns = patterns_and_functions.map do |pattern, function|
/(?<#{FUNC_GROUP_PREFIX}#{function}>#{pattern.source})/
end
Regexp.union(patterns)
end
end
We'll come back to how this works. To use it, you'll need a list of regular expressions and the name of the function that should be invoked for each one. Make sure to use named groups only:
PATTERNS_AND_FUNCTIONS = [
[/ABC(?<value>\d+)/, :foo],
[/DEF(?<value>\d+)/, :bar],
]
And the functions, including one that gets called when there's no match:
def foo(match)
p ["foo", match[:value]]
end
def bar(match)
p ["bar", match[:value]]
end
def default(match)
p ["default"]
end
And, finally, here's how it's used. BigPatternMatcher#match
takes the string to match, and the object on which the function should be called:
matcher = BigPatternMatcher.new(PATTERNS_AND_FUNCTIONS, :default)
matcher.match('blah ABC1 blah', self) # => ["foo", "1"
matcher.match('blah DEF2 blah', self) # => ["bar", "2"]
matcher.match('blah blah', self) # => ["default"]
See below the fold for the trickery that makes it work.
BigPatternMatcher#make_big_regex
combines all of the regular expressions into one, each surrounded in parentheses and separated by |
, e.g.
/(ABC(?<value>\\d+))|(DEF(?<value>\\d+))/
That isn't enough, though. We need some way, when one of the sub-expression matches, to identify which one matched, and therefore which function to call. To do that, we'll make each sub-expression its own named group, with the name based on the function that should be called:
/(?<func_foo>ABC(?<value>\\d+))|(?<func_bar>DEF(?<value>\\d+))/
Now, let's see how we get from a match to calling a function. Given the string:
"blah ABC1 blah"
Then the match group is:
#<MatchData "ABC1" func_foo:"ABC1" value:"1" func_bar:nil value:nil>
To figure out which function to call, we just have to find the match with a name that starts with "func_" and which has a non-nil value. The part of the group's name after "func_" names the function to call.
Measuring the performance of this against the straightforward technique in your question is an exercise for the reader.