Search code examples
jsonscalaapache-spark

How to read the json file in spark using scala?


I want to read the JSON file in the below format:-

 {
  "titlename": "periodic",
    "atom": [
         {
          "usage": "neutron",
          "dailydata": [
    {
      "utcacquisitiontime": "2017-03-27T22:00:00Z",
      "datatimezone": "+02:00",
      "intervalvalue": 28128,
      "intervaltime": 15          
    },
    {
      "utcacquisitiontime": "2017-03-27T22:15:00Z",
      "datatimezone": "+02:00",
      "intervalvalue": 25687,
      "intervaltime": 15          
    }
   ]
  }
 ]
}

I am writing my read line as:

sqlContext.read.json("user/files_fold/testing-data.json").printSchema

But I not getting the desired result-

root                                                                            
  |-- _corrupt_record: string (nullable = true)

Please help me on this


Solution

  • I suggest using wholeTextFiles to read the file and apply some functions to convert it to a single-line JSON format.

    val json = sc.wholeTextFiles("/user/files_fold/testing-data.json").
      map(tuple => tuple._2.replace("\n", "").trim)
    
    val df = sqlContext.read.json(json)
    

    You should have the final valid dataframe as

    +--------------------------------------------------------------------------------------------------------+---------+
    |atom                                                                                                    |titlename|
    +--------------------------------------------------------------------------------------------------------+---------+
    |[[WrappedArray([+02:00,15,28128,2017-03-27T22:00:00Z], [+02:00,15,25687,2017-03-27T22:15:00Z]),neutron]]|periodic |
    +--------------------------------------------------------------------------------------------------------+---------+
    

    And valid schema as

    root
     |-- atom: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- dailydata: array (nullable = true)
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- datatimezone: string (nullable = true)
     |    |    |    |    |-- intervaltime: long (nullable = true)
     |    |    |    |    |-- intervalvalue: long (nullable = true)
     |    |    |    |    |-- utcacquisitiontime: string (nullable = true)
     |    |    |-- usage: string (nullable = true)
     |-- titlename: string (nullable = true)