I am looking to get only the column names from a parquet file (with partitioning) using the arrow package in R. My hope is to have a vector of only the column names. I am able to do this using collect, however working with larger multi partition and multi file parquets takes longer than expected. Here is an example of what I have and hoping to achieve.
Create parquet with partion (some may have multiple partitions)
arrow::write_dataset(mtcars, "C:/Data/parquet/mtcars", format = "parquet", partitioning = c("cyl"))
Current way to get parquet column names
colnames(arrow::open_dataset(sources = "C:/Data/parquet/mtcars") %>%
dplyr::collect())
Result of using colnames with collect
[1] "mpg" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb" "cyl"
I feel there is a more efficient way to get parquet column names without doing a collect. End goal to have a vector like above. Open to options and ideas.
According to the documentation, the Dateset object has got a schema method from which you can get the columns names.
I think it should be something like that:
arrow::open_dataset(sources = "C:/Data/parquet/mtcars")$schema$names
This will only load the metadata of the dataset and should be much faster thant loading all the data.