Search code examples
regexoptimizationperformancenegative-lookbehind

Making Regular Expression more efficient


I'm attempting to determine the end of an English sentence (only approximately), by looking for "!", "?" or ".", but in the case of "." only when not preceeded by common abbreviations such as Mr. or Dr.

Is there any way to make the following Regular Expression even marginally more efficient? Perhaps by sorting the negative lookbehinds by descending size, or even by alphabetical order?

Here is the regular expression I have now:

((?<!St|Sgt|Rev|Ltd|Inc|Lt|Jr|Sr|Esq|Inst|Hon|Gen|Cpl|Comdr|Col|Corp|Mr|Dr|Gov|Mrs|Ms|[A-Z]|Assn|Capt)(\.)|(!)|(\?))(\s*$|\s+([_$#]|[A-Z][^.]))

The Problem:

The site at http://regex.powertoy.org/ reports: "7 matches 21044 probes (finished)" on even a simple paragraph... That outrageous size of the number 21044 seems intimately tied to the number of negative lookbehinds.

I'm seeking to reduce the computational complexity for the RegEx engine, as I have several GB of data to pass through it.

Is there any way to tigten this up? Is negative lookbehind truly the best/only way to accomplish this? Is there a way to do it as a lookahead instead? Is regex the wrong tool for this task?

EDIT: I am able to use either ActionScript's or PHP's RegEx engine.

EDIT: I cannot count on the number of spaces between sentences. Really!? sigh.

Please don't answer if you have no understanding of the inner workings of the RegEx engine, as pertaining to optimization.

Thanks in advance.


Solution

  • Perhaps try only doing the negative-lookbehind test after successfully matching . rather than at every character:

    (?x:  # Allow spacing and comments
        (   
            (\.)    # First match "."
            (?<!    # Then negative-look-behind for titles followed by "."
                (?: St|Sgt|Rev|Ltd|Inc|Lt|Jr|Sr|Esq|Inst|Hon|Gen|Cpl|Comdr|Col|Corp|Mr|Dr|Gov|Mrs|Ms|[A-Z]|Assn|Capt)
                \.
            )
          |  (!)  
          |  (\?)
        )
        ( \s* $  |  \s+ ( [_$#] | [A-Z] [^.] ))
    )
    

    This took the number of probes from 70000 down to 2500 or so on powertoy.org using that site's initial help text. (But powertoy didn't like my multi-line regex or the "x" flag or something so I had to squash the regex onto one line to test it).

    You can go even further by using common prefixes in your list of titles:

    (?x:  # Allow spacing and comments
        (
            (\.)    # First match "."
            (?<!    # Then negative-look-behind for titles followed by "."
                (?:Assn|C(?:apt|ol|omdr|orp|pl)|Dr|Esq|G(?:en|ov)|Hon|I(?:nc|nst)|Jr|L(?:t|td)|M(?:[rs]|rs)|Rev|S(?:gt|[rt])|[A-Z])
                \.
            )
          |  (!)  
          |  (\?)
        )
        ( \s* $  |  \s+ ( [_$#] | [A-Z] [^.] ))
    )
    

    This took the probe count down to about 2000.

    EDIT:
    Another trick that reduces the probe-count is to include a look-ahead for a capital letter at the start of the look-behind section (but I couldn't say for sure that it makes the regex more efficient) (Also including @Swiss's suggestion for word-boundary):

            (?<!   # Then negative-look-behind for titles followed by "."
                   \b (?= [A-Z] )  # But first ensure we have a capital letter before going on
                   (?:Assn|C(?:apt|ol|omdr|orp|pl)|Dr|Esq|G(?:en|ov)|Hon|I(?:nc|nst)|Jr|L(?:t|td)|M(?:[rs]|rs)|Rev|S(?:gt|[rt])|[A-Z])
                \.
            )