Search code examples
javahadoopapache-sparkavroemr

Transform JSON into Parquet using EMR/Spark


I have a huge amount of JSON files that I need to transform into Parquet. They look something like this:

{
  "foo": "bar",
  "props": {
    "prop1": "val1",
    "prop2": "val2"
  }
}

And I need to transform them into a Parquet file whose structure is this (nested properties are made top-level and get _ as a suffix):

foo=bar
_prop1=val1
_prop2=val2

Now here's the catch: not all of the JSON documents have the same properties. So, if doc1 has prop1 and prop2, but doc2 has prop3, the final Parquet file must have the three properties (some of them will be null for some of the records).

I understand that Parquet needs a schema up front, so my current plan is:

  • Traverse all the JSON files
  • Infer a schema per document (using Kite, like this)
  • Merge all the schemas
  • Start writing the Parquet

This approach strikes me as very complicated, slow and error-prone. I'm wondering if there's a better way to achieve this using Spark.


Solution

  • Turns out Spark already does this for you. When it reads JSON documents and you do not specify a schema it will infer/merge them for you. So in my case, something like this would work:

    val flattenedJson: RDD[String] = sparkContext.hadoopFile("/file")
      .map(/*parse/flatten json*/)
    
    sqlContext
      .read
      .json(flattenedJson)
      .write
      .parquet("destination")