I receive a single json file from a client which is not correct.
The client concatenates multiple json resposnses into one file:
{
object1
{
...
}
}
{
object2
{
...
}
}
...
When I parse it with dataframe in pyspark, I always get a count of only one root object which is correct, because it reads only the first object and doesn't care about the rest.
I need to somehow handle this and I'm trying to figure out what is the best way performance wise?
Can dataframe handle bad jsons or can I easily fix this with python?
You can use the jq
module to parse the data.
>>> data = open("tmp.json").read()
>>> data
'{"foo": 1}\n{"bar": 2}\n'
>>> import jq
>>> jq.compile(".").input_text(data).all()
[{'foo': 1}, {'bar': 2}]
What you have isn't really invalid or incorrect; it's just a stream of individual JSON objects rather than a single JSON value like an array.