Search code examples
pythonjsonpyspark

How to best handle badly concatenated json


I receive a single json file from a client which is not correct.
The client concatenates multiple json resposnses into one file:

{
    object1
    {
        ...
    }
}   
{
    object2
    {
        ...
    }
}
...

When I parse it with dataframe in pyspark, I always get a count of only one root object which is correct, because it reads only the first object and doesn't care about the rest.
I need to somehow handle this and I'm trying to figure out what is the best way performance wise?
Can dataframe handle bad jsons or can I easily fix this with python?


Solution

  • You can use the jq module to parse the data.

    >>> data = open("tmp.json").read()
    >>> data
    '{"foo": 1}\n{"bar": 2}\n'
    >>> import jq
    >>> jq.compile(".").input_text(data).all()
    [{'foo': 1}, {'bar': 2}]
    

    What you have isn't really invalid or incorrect; it's just a stream of individual JSON objects rather than a single JSON value like an array.