I have json files read into data frame. The json can have a struct field messages that is specific to name like below.
Json1
{
"ts":"2020-05-17T00:00:03Z",
"name":"foo",
"messages":[
{
"a":1810,
"b":"hello",
"c":390
}
]
}
Json2
{
"ts":"2020-05-17T00:00:03Z",
"name":"bar",
"messages":[
{
"b":"my",
"d":"world"
}
]
}
when I read data from jsons into a Dataframe I get schema like below.
root
|-- ts: string (nullable = true)
|-- name: string (nullable = true)
|-- messages: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: long (nullable = true)
| | |-- d: string (nullable = true)
This is fine. Now when I save to parquet file partitioned by name, how can I have different schemas in foo and bar partitions?
path/name=foo
root
|-- ts: string (nullable = true)
|-- name: string (nullable = true)
|-- messages: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: long (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: long (nullable = true)
path/name=bar
root
|-- ts: string (nullable = true)
|-- name: string (nullable = true)
|-- messages: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- b: string (nullable = true)
| | |-- d: string (nullable = true)
I am fine if I get schema with all fields of foo and bar when I read data from root path. But when I read data from path/name=foo, I am expecting just foo schema.
1. Partitioning & Storing as Parquet file:
If you save as parquet format then while reading path/name=foo
specify the schema
including all the required fields(a,b,c), Then spark only loads those fields.
won't
specify schema then all fields(a,b,c,d) are going to be included in the dataframeEX:
schema=define structtype...schema
spark.read.schema(schema).parquet(path/name=foo).printSchema()
2.Partitioning & Storing as JSON/CSV file:
Then Spark won't add b,d columns into path/name=foo
files, so when we read only the name=foo directory we won't get b,d
columns included in the data.
EX:
spark.read.json(path/name=foo).printSchema()
spark.read.csv(path/name=foo).printSchema()