regex windows performance perl activestate

upgraded from perl 5.8 (32bit) to 5.16 (64bit) - regex performance hit

I'm running a series of regexes against blocks of data. We recently upgraded from Activestate perl 5.8 32bit (I know... extremely old!) to perl 5.16 64bit. All the hardware stayed the same (windows).

We are noticing a performance hit where as before our parse loop would take about 2.5 seconds, now it takes about 5 seconds. Can anybody give me a hint as to what would cause the change? I was expecting an increase in performance as my understanding was that the engine had improved greatly, any docs on what I should be doing different would be greatly appreciated.

Solution

Yes, the regex engine improved greatly after v8. Alone in v10, we saw:

pattern recursion
named captures
possessive quantifiers
backtrack control verbs like (*FAIL) or (*SKIP).
The \K operator
… and some more

Also, more internals were made Unicode-aware.

In v12, the Unicode support was cleaned up. The \p and \X operators in regexes are now greatly enhanced.

In v14, the Unicode support was bumped to 6.0. Charnames for the \N operator were improved (see also charnames pragma). The new character model can treat any unsigned integer as a codepoint. In the regex engine,

regexes can now carry charclass modifiers like /u, /d, /l, /a, /aa.
Non-destructive susbtitution with /r was implemented.
The RE engine is now reentrant, so embedded code can use regexes.
\p was cleaned up
regex compilation is faster when a switch to unicode semantics is neccessary.

In v16, perl almost supports Unicode 6.1. In the regex engine,

efficiency of \p charclasses was increased.
Various regex bugs (often involving case-insensitive matching) were fixed.

Obviously, not all of these features come at a price, but especially Unicode-awareness makes internals more complicated, and slower.

You also cannot waive a hand and state that the execution time of a script doubled from perl5 v8 x86 to perl5 v16 x64; there are too many variables:

were both Perls compiled with the same flags?
- are both perls threaded perls (disabling threading support makes it faster)
- how big are your integers? 64 bit or 32 bit?
- what compiler optimizations were chosen?
did your previous Perl have some distribution-specific patches applied?

Basically, you have to compare the whole perl -V output.

If you are hitting a performance ceiling with regexes, they may be the wrong tool for extensive parsing. At the very least, you may use the newer features to optimize the regexes to eliminate some backtracking.

If your parsing code describes a (roughly) context-free language (i.e. you don't use (?{...}), (?=...) or related regex features), and parsing means doing something like generating a tree, then Marpa::R2 might speed things up considerably.