apache-spark parquet databricks azure-databricks

Can we have different schema per row-group in the same parquet file?

Can we have different schema per row group while creating a parquet file? In that case the footer will have the union of all schemas across all row groups but each row group's schema will be different. Is this a recognized parquet format? Does the parquet specification clearly indicate that the schema cannot change per row group in the same parquet file?

The official specification isn't very specific about this part but Spark fails to read when we write files this way.

I tried writing such a file and reading using spark.read.parquet and I get the following error

// this line works fine and it shows the schema from the footer where we have a unioned schema of all the rowgroups.
var df = spark.read.option("mergeSchema", "true").parquet("abc.parquet") 

// but when I try to do df.show() it throws an error
df.show()

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 86, 10.139.64.6, executor 0): java.lang.IllegalArgumentException: [Visibility_value_string] optional binary Visibility_value_string (UTF8) is not in the store: .....

The spec here only says that the columns should be in the same order as in the FileMetadata, which I interpret as, I can have more columns in the consequent row groups.

The spec only says that schema in every row group must contain columns in the same order as that of the FileMetadata but it doesn't really say that it should contain all the columns. In that case, can we have more columns in subsequent row groups?

row group 1 -> col1, col2
row group 2 -> col1, col2, col3
row group 3 -> col1, col2, col3, col4
file metadata -> col1, col2, col3, col4

Is this an acceptable parquet format, if not, why?

Solution

Individual files need to be internally consistent, but you can have "compatible" but different schemas when you have multiple files.