Search code examples
phpserializationpreg-replace-callback

rework preg_replace with preg_replace_callback


I've seen many answers about this but as this one is a bit specific, I still need some help. I'm trying to update Blogstudio's Fix Serialization script which contains preg_replace() with \e modifier.

The code in question is this:

$data = preg_replace('!s:(\d+):([\\\\]?"[\\\\]?"|[\\\\]?"((.*?)[^\\\\])[\\\\]?");!e', "'s:'.strlen(unescape_mysql('$3')).':\"'.unescape_quotes('$3').'\";'", $data);

The confusion for me lies in:

  1. Whether those functions are intending to address escaped quotes due to the /e modifier or not?
  2. What the result should be when there is not a $3?

I had rewritten it as this but still running into warnings and other problems so the result is not the same as what's intended:

$data = preg_replace_callback(
    '!s:(\d+):([\\\\]?"[\\\\]?"|[\\\\]?"((.*?)[^\\\\])[\\\\]?");!',
    function($d) {
        $length = strlen(unescape_mysql($d[3]));
        $value = unescape_quotes($d[3]);
        $result = 's:' . $length . ':\"' . $value . '\";';
        return 's:' . $length . ':\"' . $value . '\";'
    },
    $data
);

Solution

  • The problem:

    s:(\d+): # group 1
    (        # group 2
        [\\\\]?"[\\\\]?"
      |
        [\\\\]?"
        ((.*?)[^\\\\]) # group 3 (and 4)
        [\\\\]?"
    )
    ;
    

    As you can see there's an alternation with 2 branches inside the group 2. Groups 3 (and 4) are in the second branch, when the first branch succeeds these groups are not defined.

    Let's clean the pattern removing useless capture groups:

    s:\d+:
    (?:
        [\\\\]? " [\\\\]? "
      |
        [\\\\]? "
        (.*? [^\\\\])      # group 1
        [\\\\]? "
    )
    ;
    

    Now the target group is the group 1, but the branch problem remains. There's two possible ways to solve it:

    • you can test if the index exists with isset in the callback function.
    • you can change the pattern in a way group 1 is defined in the two branches using the branch reset feature.

    First way:

    $data = preg_replace_callback(
       '~s:\K\d+:(?:[\\\\]?"[\\\\]?"|[\\\\]?"(.*?[^\\\\])[\\\\]?");~', 
       function ($m) {
         return (isset($m[1]))
           ? strlen(unescape_mysql($m[1])) . ':\"' . $m[1] . '\";'
           : '0:\"\";';
       },
       $data
    );
    

    Second way (with the branch reset feature):

    $data = preg_replace_callback(
       '~s:\K\d+:(?|[\\\\]?"[\\\\]?"()|[\\\\]?"(.*?[^\\\\])[\\\\]?");~', 
       function ($m) {
         return strlen(unescape_mysql($m[1])) . ':\"' . $m[1] . '\";';
       },
       $data
    );
    

    In a branch reset group capture groups have the same numbers in each branch, to solve your problem you only need to create an empty capture group in the first branch:

    (?|  # open a branch reset group
         foo
         ()  # capture group 1
      |
         bar
         (baz) # capture group 1 (too)
    )