python json amazon-web-services pyspark aws-glue

Read Json in Pyspark

I want to read a JSON file in PySpark, but the JSON file is in this format (without comma and square brackets):

{"id": 1, "name": "jhon"}
{"id": 2, "name": "bryan"}
{"id": 3, "name": "jane"}

Is there an easy way to read this JSON in PySpark?

I have already tried this code:

df= spark.read.option("multiline", "true").json("data.json")
df.write.parquet("data.parquet")

But it doesn't work: in parquet file just the first line appears.

I just want to read this JSON file and save as parquet...

Solution

Only the first line appears while reading data from your mentioned file because of multiline parameter is set as True but in this case one line is a JSON object. So if you set multiline parameter as False it will work as expected.

df= spark.read.option("multiline", "false").json("data.json")
df.show()

In case if your JSON file would have had a JSON array in file like

[
{"id": 1, "name": "jhon"},
{"id": 2, "name": "bryan"},
{"id": 3, "name": "jane"}
]

[
    {
        "id": 1, 
        "name": "jhon"
    },
    {
        "id": 2, 
        "name": "bryan"
    }
]

multiline parameter set to True will work.