Search code examples
rparquetapache-arrow

Identify partitioning variable in parquet file


Is there an easy way of identifying the variable that was used to partition a parquet dataset?


As an example, below I create a toy parquet using the mtcars dataset.

# Load library
library(arrow)

# Write data to parquet
mtcars |> write_dataset("~/boop", partitioning = "cyl")

One approach to determining the partitioning variable(s) could be to view the files that the parquet is composed of, like so:

# Open dataset & see files that are part of parquet
open_dataset("~/boop")$files

# [1] "XXXXX/boop/cyl=4/part-0.parquet" "XXXXX/boop/cyl=6/part-0.parquet"
# [3] "XXXXX/boop/cyl=8/part-0.parquet"

Here, I can see that cyl is the partitioning variable, but I would need to parse that out and if there are several partitioning variables it might get a smidge involved.

Is there a simple way of determining the partitioning variable? For example, is there a metadata variable that records this information?


Solution

  • Until someone suggests a better solution, this seems to work:

    # Load library
    library(arrow)
    
    # Write data to parquet
    mtcars |> write_dataset("~/boop", partitioning = c("cyl", "gear"))
    
    # Files in parquet
    pq_files <- open_dataset("~/boop")$files
    
    # Extract partiton names assuming */partition_name=value/* format
    regmatches(pq_files, gregexpr("(?<=/)[^/]*(?==)", pq_files, perl = TRUE)) |> unlist() |> unique()
    # [1] "cyl"  "gear"
    

    As suggested in the question, I look at the files in the parquet and then use some regex to look for text sandwiched between / and = that should correspond to partitions.