Search code examples
javajsonregexsymbolsreplaceall

Regular expression for splitting JSON text in lines after symbols


I am trying to use a regular expression to have this kind of string

{
 "key1"
:
value1
,
"key2"
:
"value2"
,
"arrayKey"
:
[
{
"keyA"
:
valueA
,
"keyB"
:
"valueB"
,
"keyC"
:
[
0
,
1
,
2
]
}
]
}

from

JSONObject.toString()

that is one long line of text in my Android Java app

{"key1":"value1","key2":"value2","arrayKey":[{"keyA":"valueA","keyB":"valueB","keyC":[0,1,2]}]}

I found this regular expression for finding all commas.

/(,)(?=(?:[^"]|"[^"]*")*$)/

Now I need to know:

0- if this is reliable, that is, does what they say.

1- if this is works also with commas inside double-quotes.

2- if this takes into account escaped double-quotes.

3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.

5- It has to be used with the multi-line flag to work with multi-line text.

6- It has to work with replaceAll().

The resulting regular expression will be be used for replacing each symbol with a two-char sequence made of the symbol itself plus \n character.

The resulting text has to be still JSON text.

Subsequent replace actions will take place also for the other symbols

: [ ] { } 

and other symbols that can be found in JSON files outside the alphanumeric sequences between quotes (I do not know if the mentioned symbols are the only ones).


Solution

  • 0- if this is reliable, that is, does what they say.

    Let's break down the expression a little:

    • (,) is a capturing group that matches a single comma
    • (?=...) would mean a positive lookahead meaning the comma would need to be followed by a match of that group's content
    • (?:...)* would be a non-capturing group that can occur 0 to many times
    • [^"]|"[^"]*" would match either any character except a double quote ([^"]) or (|) a pair of double quotes with any character in between except other double quotes ("[^"]*")

    As you can see especially the last part could make it unreliable if there are escaped double quotes in a text value, so the answer would be "this is reliable if the input is simple enough".

    1- if this is works also with commas inside double-quotes.

    If the double quote pairs are correctly identified any commas in between would be ignored.

    2- if this takes into account escaped double-quotes.

    Here's one of the major problems: escaped double quotes would need to be handled. This can get quite complex if you want to handle arbitrary cases, especially if the texts could contain commas as well.

    3- if I have to take into account also single quotes, as this file is produced by my app but occasionally it could be manually edited by the user.

    Single quotes aren't allowed by the JSON sepcification but many parsers support them because humans tend to use them anyway. Thus you might need to take them into account and that makes no. 2 even more complex because now there might be an unescaped double quote in a single quote text.

    5- It has to be used with the multi-line flag to work with multi-line text.

    I'm not entirely sure about that but adding the multi-line flag shouldn't hurt. You could add it to the expression itself though, i.e. by prepeding (?m).

    6- It has to work with replaceAll().

    In its current form the regex would work with String#replaceAll() because it only matches the comma - the lookahead is used to determine a match but won't result in the wrong parts being replaced. The matches themselves might not be correct though, as described above.

    That being said, you should note that JSON is not a regular language and only regular languages are a perfect fit for regular expressions.

    Thus I'd recommend using a proper JSON parser (there are quite a lot out there) to parse the JSON into POJOs (might just be a bunch of generic JsonObject and JsonArray instances) and reformat that according to your needs.

    Here's an example of how Jackson could be used to accomplish that: https://kodejava.org/how-to-pretty-print-json-string-using-jackson/

    In fact, since you're already using JSONObject.toString() you probably don't need the parser itself but just a proper formatter (if you want/need to roll your own you could have a look at the org.json.JSONObject sources ).