I have a huge amount of JSON
files that I need to transform into Parquet
. They look something like this:
{
"foo": "bar",
"props": {
"prop1": "val1",
"prop2": "val2"
}
}
And I need to transform them into a Parquet
file whose structure is this (nested properties are made top-level and get _
as a suffix):
foo=bar
_prop1=val1
_prop2=val2
Now here's the catch: not all of the JSON
documents have the same properties. So, if doc1 has prop1
and prop2
, but doc2 has prop3
, the final Parquet
file must have the three properties (some of them will be null for some of the records).
I understand that Parquet
needs a schema
up front, so my current plan is:
JSON
filesschema
per document (using Kite, like this)schemas
Parquet
This approach strikes me as very complicated, slow and error-prone. I'm wondering if there's a better way to achieve this using Spark
.
Turns out Spark already does this for you. When it reads JSON documents and you do not specify a schema it will infer/merge them for you. So in my case, something like this would work:
val flattenedJson: RDD[String] = sparkContext.hadoopFile("/file")
.map(/*parse/flatten json*/)
sqlContext
.read
.json(flattenedJson)
.write
.parquet("destination")