regexjson

Regex to validate JSON


I am looking for a Regex that allows me to validate json.

I am very new to Regex's and i know enough that parsing with Regex is bad but can it be used to validate?


Solution

  • Yes, a complete regex validation is possible.

    Some modern regex implementations allow for recursive regular expressions, which can verify a complete JSON serialized structure. The json.org specification makes it quite straightforward.

    $pcre_regex = '/
        (?(DEFINE)
            (?<ws>      [\t\n\r ]* )
            (?<number>  -? (?: 0|[1-9]\d*) (?: \.\d+)? (?: [Ee] [+-]? \d++)? )    
            (?<boolean> true | false | null )
            (?<string>  " (?: [^\\\\"\x00-\x1f] | \\\\ ["\\\\bfnrt\/] | \\\\ u [0-9A-Fa-f]{4} )* " )
            (?<pair>    (?&ws) (?&string) (?&ws) : (?&value) )
            (?<array>   \[ (?: (?&value) (?: , (?&value) )* )? (?&ws) \] )
            (?<object>  \{ (?: (?&pair) (?: , (?&pair) )* )? (?&ws) \} )
            (?<value>   (?&ws) (?: (?&number) | (?&boolean) | (?&string) | (?&array) | (?&object) ) (?&ws) )
        )
        \A (?&value) \Z
        /sx';
    

    The example above uses the Perl 5.10/PCRE2 subroutine call syntax to simplify the expression and improve readability. It works quite well in PHP with the PCRE functions. Should work almost unmodified in Perl (provided one replaces 4-backslash sequences '\\\\' with 2-backslash sequences '\\' in the <string> subroutine); and can be adapted for other languages (e.g. Ruby, or those for which PCRE bindings are available).

    This regex passes all tests from the JSON.org test suite (see link at the end of the page) as well as those from Nicolas Seriot's JSON Parser test suite.1

    Simpler RFC4627 verification

    A simpler approach is the minimal consistency check as specified in RFC4627, section 6. It's however just intended as security test and basic non-validity precaution:

    var jsonCode = /* untrusted input */;
    
    var jsonObject = !(/[^,:{}\[\]0-9.\-+Eaeflnr-u \n\r\t]/.test(
        jsonCode.replace(/"(\\.|[^"\\])*"/g, '')))
        && eval('(' + jsonCode + ')');
    

    1 With the exception of two cases whose input is very large, causing the regex to time out. More generally, this approach is bound to fail on inputs large enough to hit the resource limits of the matching engine (either in time or space).