Search code examples
rparquet

Read only parquet column names in R


I am looking to get only the column names from a parquet file (with partitioning) using the arrow package in R. My hope is to have a vector of only the column names. I am able to do this using collect, however working with larger multi partition and multi file parquets takes longer than expected. Here is an example of what I have and hoping to achieve.

Create parquet with partion (some may have multiple partitions)

arrow::write_dataset(mtcars, "C:/Data/parquet/mtcars", format = "parquet", partitioning = c("cyl"))

Current way to get parquet column names

colnames(arrow::open_dataset(sources = "C:/Data/parquet/mtcars") %>%
  dplyr::collect())

Result of using colnames with collect

[1] "mpg"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb" "cyl"

I feel there is a more efficient way to get parquet column names without doing a collect. End goal to have a vector like above. Open to options and ideas.


Solution

  • According to the documentation, the Dateset object has got a schema method from which you can get the columns names.

    I think it should be something like that:

    arrow::open_dataset(sources = "C:/Data/parquet/mtcars")$schema$names
    

    This will only load the metadata of the dataset and should be much faster thant loading all the data.