Search code examples
pythonjsonamazon-web-servicespysparkaws-glue

Read Json in Pyspark


I want to read a JSON file in PySpark, but the JSON file is in this format (without comma and square brackets):

{"id": 1, "name": "jhon"}
{"id": 2, "name": "bryan"}
{"id": 3, "name": "jane"}

Is there an easy way to read this JSON in PySpark?

I have already tried this code:

df= spark.read.option("multiline", "true").json("data.json")
df.write.parquet("data.parquet")

But it doesn't work: in parquet file just the first line appears.

I just want to read this JSON file and save as parquet...


Solution

  • Only the first line appears while reading data from your mentioned file because of multiline parameter is set as True but in this case one line is a JSON object. So if you set multiline parameter as False it will work as expected.

    df= spark.read.option("multiline", "false").json("data.json")
    df.show()
    

    In case if your JSON file would have had a JSON array in file like

    [
    {"id": 1, "name": "jhon"},
    {"id": 2, "name": "bryan"},
    {"id": 3, "name": "jane"}
    ]
    

    or

    [
        {
            "id": 1, 
            "name": "jhon"
        },
        {
            "id": 2, 
            "name": "bryan"
        }
    ]
    

    multiline parameter set to True will work.