I'm using JsonCpp to parse JSON in C++.
e.g.
Json::Reader r;
std::stringstream ss;
ss << "{\"name\": \"sample\"}";
Json::Value v;
assert(r.parse(ss, v)); // OK
assert(v["name"] == "sample"); // OK
But my actual input is a whole stream of JSON messages, that may arrive in chunks of any size; all I can do is to get JsonCpp to try to parse my input, character by character, eating up full JSON messages as we discover them:
Json::Reader r;
std::string input = "{\"name\": \"sample\"}{\"name\": \"aardvark\"}";
for (size_t cursor = 0; cursor < input.size(); cursor++) {
std::stringstream ss;
ss << input.substr(0, cursor);
Json::Value v;
if (r.parse(ss, v)) {
std::cout << v["name"] << " ";
input.erase(0, cursor);
}
} // Output: sample aardvark
This is already a bit nasty, but it does get worse. I also need to be able to resync when part of an input is missing (for any reason).
Now it doesn't have to be lossless, but I want to prevent an input such as the following from potentially breaking the parser forever:
{"name": "samp{"name": "aardvark"}
Passing this input to JsonCpp will fail, but that problem won't go away as we receive more characters into the buffer; that second name
is simply invalid directly after the "
that precedes it; the buffer can never be completed to present valid JSON.
However, if I could be told that the fragment certainly becomes invalid as of the second n
character, I could drop everything in the buffer up to that point, and then simply wait for the next {
to consider the start of a new object, as a best-effort resync.
So, is there a way that I can ask JsonCpp to tell me whether an incomplete fragment of JSON has already guaranteed that the complete "object" will be syntactically invalid?
That is:
{"name": "sample"} Valid (Json::Reader::parse == true)
{"name": "sam Incomplete (Json::Reader::parse == false)
{"name": "sam"LOL Invalid (Json::Reader::parse == false)
I'd like to distinguish between the two fail states.
Can I use JsonCpp to achieve this, or am I going to have to write my own JSON "partial validator" by constructing a state machine that considers which characters are "valid" at each step through the input string? I'd rather not re-invent the wheel...
It certainly depends if you actually control the packets (and thus the producer), or not. If you do, the most simple way is to indicate the boundaries in a header:
+---+---+---+---+-----------------------
| 3 | 16|132|243|endofprevious"}{"name":...
+---+---+---+---+-----------------------
The header is simple:
and then comes the buffer itself.
Upon receiving such a packet, the following entries can be parsed:
previous + current[0:16]
current[16:132]
current[132:243]
And current[243:]
is saved for the next packet (though you can always attempt to parse it in case it's complete).
This way, the packets are auto-synchronizing, and there is no fuzzy detection, with all the failure cases it entails.
Note that there could be 0
boundaries in the packet. It simply implies that one object is big enough to span several packets, and you just need to accumulate for the moment.
I would recommend making the numbers representation "fixed" (for example, 4 bytes each) and settling on a byte order (that of your machine) to convert them into/from binary easily. I believe the overhead to be fairly minimal (4 bytes + 4 bytes per entry given that {"name":""}
is already 11 bytes).