Search code examples
jsonscalaapache-sparkspark-shelljsonlines

Not able to create dataframe out of multi line json string or JSONL string using spark


I have been trying to form data frame out of jsonl string. I'm able to form data frame but the problem is only single row is being read, ignoring other.
Here are things I tries in spark-shell

// This one is example multiline json.
val jsonEx = "{\"name\":\"James\"}{\"name\":\"John\"}{\"name\":\"Jane\"}"

// schema for it is 
val sch = new StructType().add("name", StringType)

val ds = Seq(jsonEx).toDS()

// 1st attempt -- using multiline and spark.json
spark.read.option("multiLine", true).schema(sch).json(ds).show
+-----+
| name|
+-----+
|James|
+-----+

// 2nd attempt -- using from_json
ds.withColumn("json", from_json(col("value"), sch)).select("json.*").show
+-----+
| name|
+-----+
|James|
+-----+

//3rd attempt -- using from_json in little different way
ds.select(from_json(col("value"), sch) as "json").select("json.*").show
+-----+
| name|
+-----+
|James|
+-----+

I even tried updating string as, 
 val jsonEx = "{\"name\":\"James\"}\n{\"name\":\"John\"}\n{\"name\":\"Jane\"}"
and 
val jsonEx = "{\"name\":\"James\"}\n\r{\"name\":\"John\"}\n\r{\"name\":\"Jane\"}"

But the result was same.

Does anyone what am I missing here ?

if someone is wondering why am I not reading from file instead of string. I have one jsonl config file inside resources path. when I try to read it using getClass.getResource scala gives me error while getClass.getResourceAsStream works and I'm able to read data.

val configPath = "/com/org/example/data_sources_config.jsonl"
for(line <- Source.fromInputStream(getClass.getResourceAsStream(configPath)).getLines) { print(line)}
{"name":"james"} ...

but when I do, 
for(line <- Source.fromFile(getClass.getResource(configPath).getPath).getLines) { print(line)}
java.io.FileNotFoundException: file:/Users/sachindoiphode/workspace/dap-links-datalake-jobs/target/dap-links-datalake-jobs-0.0.65.jar!/com/org/example/data_sources_config.jsonl (No such file or directory)

Solution

  • Even if jsonEx is multi-line JSON. It still one element. You need to extract rows out of it.

    val ds = jsonEx.split("\n").toSeq.toDS
    

    To read the multi-line JSON file maybe you can try something like this:

    val path = "/com/org/example/data_sources_config.jsonl"
    val source = Source.fromFile(getClass.getResource(path).getPath)
    val content = source.getLines.mkString
    

    Then do content.split().toSq.toDF if you want to create a dataframe out of it.