Search code examples
phpjsonregexpreg-match-allpcre2

Regex | PHP capture each illegal double quote in a json string


Given the following json string: {"key":"val"ue","other":"invalid ""quo"te"}

I want to capture each illegal double quote inside the values. In the example there is one double quote in the value of the key property and there are three double quotes in the property called other.

I've seen multiple comments noting that this is invalid json (correct) and that the supplied json should be valid before receiving. However this is not possible in my case.

Assuming that this would only occur in the values and not in keys I think it's safe to assume that a starting sequence would be a colon followed by a double quote. An ending sequence would be a double quote followed by comma OR closing curly brace.

I've tried the following regex (among many other versions) which is the closest so my desired solution:

/:\s?".*?(").*?[,}]/i

This correctly captures the one double quote in the key property, but only captures the first double quote in the 'other' property. I would like it to capture the other two double quotes as well as a separate capture.

Another regex I've tried: /:\s?".*?("{1,})[^,}].*?[,}]/i This does the same as the first regex, but captures the two double quotes in one capture (not preferable)

My goal ultimately is to capture each double quote separately, so four captures. What I think I need in order to accomplish this is a way to make the capture group 'greedy?' so that it doesn't stop at the first double quote.

How could I achieve this?

I am using the following PHP code to test the Regex:

$text = '{"key":"val"ue","other":"invalid ""quo"te"}';
$pattern = '/:\s?".*?(").*?[,}]/i';
preg_match_all($pattern, $text, $matches, PREG_OFFSET_CAPTURE);
echo '<pre>' . print_r($matches, true) . '</pre>';

Solution

  • What you could do is to use a variant of The Trick...

    The trick is that we match what we don't want on the left side of the alternation (the |), then we capture what we do want on the right side.

    The good thing about PCRE is that there are verbs available to just skip the left side.

    (?:(?:"\s*[:,]|\{)\s*"|\\"|"\s*[:}])(*SKIP)(*F)|"
    

    See this demo at regex101

    On the left side of the (*SKIP)(*F) preceded alternation all the "correct" quotes get matched (regex101) and skipped. Any remaining quotes are matched on the right side |" individually.

    Finally you can use the PREG_OFFSET_CAPTURE to get the position of each "illegal quote".