Search code examples
javajsonhadoopapache-sparkapache-spark-dataset

How to parse a multiline json in dataset apache spark java


Is there any way to parse a multi-line json file using Dataset here is sample code

public static void main(String[] args) {

    // creating spark session
    SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
                .config("spark.some.config.option", "some-value").getOrCreate();

    Dataset<Row> df = spark.read().json("D:/sparktestio/input.json");
    df.show();
}

it works perfectly if json is in a single line,but i need it for multi line

My json file

{
  "name": "superman",
  "age": "unknown",
  "height": "6.2",
  "weight": "flexible"
}

Solution

  • Last time I checked Spark SQL docs, this stood out:

    Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.

    I've been able to address this in the past by loading the JSON using the Spark Context wholeTextFiles method which produces a PairRDD.

    See complete example in the "Spark SQL JSON Example Tutorial Part 2" section on this page https://www.supergloo.com/fieldnotes/spark-sql-json-examples/