Is there an easy way of identifying the variable that was used to partition a parquet dataset?
As an example, below I create a toy parquet using the mtcars
dataset.
# Load library
library(arrow)
# Write data to parquet
mtcars |> write_dataset("~/boop", partitioning = "cyl")
One approach to determining the partitioning variable(s) could be to view the files that the parquet is composed of, like so:
# Open dataset & see files that are part of parquet
open_dataset("~/boop")$files
# [1] "XXXXX/boop/cyl=4/part-0.parquet" "XXXXX/boop/cyl=6/part-0.parquet"
# [3] "XXXXX/boop/cyl=8/part-0.parquet"
Here, I can see that cyl
is the partitioning variable, but I would need to parse that out and if there are several partitioning variables it might get a smidge involved.
Is there a simple way of determining the partitioning variable? For example, is there a metadata variable that records this information?
Until someone suggests a better solution, this seems to work:
# Load library
library(arrow)
# Write data to parquet
mtcars |> write_dataset("~/boop", partitioning = c("cyl", "gear"))
# Files in parquet
pq_files <- open_dataset("~/boop")$files
# Extract partiton names assuming */partition_name=value/* format
regmatches(pq_files, gregexpr("(?<=/)[^/]*(?==)", pq_files, perl = TRUE)) |> unlist() |> unique()
# [1] "cyl" "gear"
As suggested in the question, I look at the files in the parquet and then use some regex to look for text sandwiched between /
and =
that should correspond to partitions.