Search code examples
phpregexereg

Expected lifespan of ereg, migrating to preg


I work on a large PHP application (>1 million lines, 10 yrs old) which makes extensive use of ereg and ereg_replace - currently 1,768 unique regular expressions in 516 classes.

I'm very aware why ereg is being deprecated but clearly migrating to preg could be highly involved.

Does anyone know how long ereg support is likely to be maintained in PHP, and/or have any advice for migrating to preg on this scale. I suspect automated translation from ereg to preg is impossible/impractical?


Solution

  • I'm not sure when ereg will be removed but my bet is as of PHP 6.0.

    Regarding your second issue (translating ereg to preg) doesn't seem something that hard, if your application has > 1 million lines surely you must have the resources to get someone doing this job for a week at most. I would grep all the ereg_ instances in your code and set up some macros in your favorite IDE (simple stuff like adding delimiters, modifiers and so on).

    I bet most of the 1768 regexes can be ported using a macro, and the others, well, a good pair of eyes.

    Another option might be to write wrappers around the ereg functions if they are not available, implementing the changes as needed:

    if (function_exists('ereg') !== true)
    {
        function ereg($pattern, $string, &$regs)
        {
            return preg_match('~' . addcslashes($pattern, '~') . '~', $string, $regs);
        }
    }
    
    if (function_exists('eregi') !== true)
    {
        function eregi($pattern, $string, &$regs)
        {
            return preg_match('~' . addcslashes($pattern, '~') . '~i', $string, $regs);
        }
    }
    

    You get the idea. Also, PEAR package PHP Compat might be a viable solution too.


    Differences from POSIX regex

    As of PHP 5.3.0, the POSIX Regex extension is deprecated. There are a number of differences between POSIX regex and PCRE regex. This page lists the most notable ones that are necessary to know when converting to PCRE.

    1. The PCRE functions require that the pattern is enclosed by delimiters.
    2. Unlike POSIX, the PCRE extension does not have dedicated functions for case-insensitive matching. Instead, this is supported using the /i pattern modifier. Other pattern modifiers are also available for changing the matching strategy.
    3. The POSIX functions find the longest of the leftmost match, but PCRE stops on the first valid match. If the string doesn't match at all it makes no difference, but if it matches it may have dramatic effects on both the resulting match and the matching speed. To illustrate this difference, consider the following example from "Mastering Regular Expressions" by Jeffrey Friedl. Using the pattern one(self)?(selfsufficient)? on the string oneselfsufficient with PCRE will result in matching oneself, but using POSIX the result will be the full string oneselfsufficient. Both (sub)strings match the original string, but POSIX requires that the longest be the result.