Search code examples
jsonscalaapache-sparkrecordcorrupt

Spark dataframe requires json file as one object in one line?


I'm new to spark and trying to use spark to read json file like this. Using spark 2.3 and scala 2.11 on ubuntu18.04, java1.8:

cat my.json:

{ "Name":"A", "No_Of_Emp":1, "No_Of_Supervisors":2}
{ "Name":"B", "No_Of_Emp":2, "No_Of_Supervisors":3}
{ "Name":"C", "No_Of_Emp":13,"No_Of_Supervisors":6}

And my scala code is:

val dir = System.getProperty("user.dir")
val conf = new SparkConf().setAppName("spark sql")
.set("spark.sql.warehouse.dir", dir)
.setMaster("local[4]");
val spark = SparkSession.builder().config(conf).getOrCreate()
val df = spark.read.json("my.json")
df.show()
df.printSchema()
df.select("Name").show()

OK, everything is fine. But if I change the json file to be multiline, standard json format:

[
    {
      "Name": "A",
      "No_Of_Emp": 1,
      "No_Of_Supervisors": 2
    },
    {
      "Name": "B",
      "No_Of_Emp": 2,
      "No_Of_Supervisors": 3
    },
    {
      "Name": "C",
      "No_Of_Emp": 13,
      "No_Of_Supervisors": 6
    }
]

Then the program will report error:

+--------------------+
|     _corrupt_record|
+--------------------+
|                   [|
|                   {|
|        "Name": "A",|
|      "No_Of_Emp"...|
|      "No_Of_Supe...|
|                  },|
|                   {|
|        "Name": "B",|
|      "No_Of_Emp"...|
|      "No_Of_Supe...|
|                  },|
|                   {|
|        "Name": "C",|
|      "No_Of_Emp"...|
|      "No_Of_Supe...|
|                   }|
|                   ]|
+--------------------+

root
 |-- _corrupt_record: string (nullable = true)

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`Name`' given input columns: [_corrupt_record];;
'Project ['Name]
+- Relation[_corrupt_record#0] json

I wish to know why this happens? A none standard json file without double [] will work(one object one line), but a more standardized formatted json will be a "corrupt record"?


Solution

  • From the official Document

    we can get some information about your question

    Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset[Row]. This conversion can be done using SparkSession.read.json() on either a Dataset[String], or a JSON file. Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON. For a regular multi-line JSON file, set the multiLine option to true.

    so if you wanted run it with your data multiline, set the multiLine option to true.

    here is the example:

    val conf = new SparkConf().setAppName("spark sql")
      .set("spark.sql.warehouse.dir", dir)
      .setMaster("local[*]")
    
    val spark = SparkSession.builder().config(conf).getOrCreate()
    
    val df = spark.read.option("multiLine", true).json("my.json")
    df.show()
    df.printSchema()
    df.select("Name").show()