With Apache Spark we can partition a dataframe into separate files when saving into Parquet format.
In the way Parquet files are written, each partition contains multiple row groups each of include column statistics pertaining to each group (e.g., min/max values, as well as number of NULL
values).
Now, it would seem ideal in some situations to organize the Parquet file such that related data appears together in one or more row groups. This would be a secondary level of partitioning within each partition file (which constitutes the first level).
This is possible using for example pyarrow, but how can we do this with a distributed SQL engine such as Spark?
Besides partitioning you can order your data to group related data together in a limited set of partitions. Statement from Databricks:
Z-Ordering is a technique to colocate related information in the same set of files
(
df
.write.option("header", True)
.orderBy(df.col_1.desc())
.partitionBy("col_2")
)