Search code examples
apache-sparkapache-spark-sqlparquet

How to have different schemas within parquet partitions


I have json files read into data frame. The json can have a struct field messages that is specific to name like below.

Json1
{
   "ts":"2020-05-17T00:00:03Z",
   "name":"foo",
   "messages":[
      {
         "a":1810,
         "b":"hello",
         "c":390
      }
   ]
}

Json2
{
   "ts":"2020-05-17T00:00:03Z",
   "name":"bar",
   "messages":[
      {
         "b":"my",
         "d":"world"
      }
   ]
}

when I read data from jsons into a Dataframe I get schema like below.

root
 |-- ts: string (nullable = true)
 |-- name: string (nullable = true)
 |-- messages: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: long (nullable = true)
 |    |    |-- d: string (nullable = true)

This is fine. Now when I save to parquet file partitioned by name, how can I have different schemas in foo and bar partitions?

path/name=foo
root
 |-- ts: string (nullable = true)
 |-- name: string (nullable = true)
 |-- messages: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: long (nullable = true)

path/name=bar
root
 |-- ts: string (nullable = true)
 |-- name: string (nullable = true)
 |-- messages: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- d: string (nullable = true)

I am fine if I get schema with all fields of foo and bar when I read data from root path. But when I read data from path/name=foo, I am expecting just foo schema.


Solution

  • 1. Partitioning & Storing as Parquet file:

    If you save as parquet format then while reading path/name=foo specify the schema including all the required fields(a,b,c), Then spark only loads those fields.

    • If we won't specify schema then all fields(a,b,c,d) are going to be included in the dataframe

    EX:

    schema=define structtype...schema
    spark.read.schema(schema).parquet(path/name=foo).printSchema()
    

    2.Partitioning & Storing as JSON/CSV file:

    Then Spark won't add b,d columns into path/name=foo files, so when we read only the name=foo directory we won't get b,d columns included in the data.

    EX:

    spark.read.json(path/name=foo).printSchema()
    spark.read.csv(path/name=foo).printSchema()