I would appreciate your help please.
I have some .parquet files in a azure blob storage container.
I am able to individual read in just one file at a time using R
The code is:
raw_data <- download_from_url("https://myacct.blob.core.windows.net/retail.parquet",
key="wRGSJ", dest=NULL)
parq_df <- read_parquet(raw_data)
Instead I want to write a loop that go through, read in the files and append all of the files in the container and append them as one parquet file.
I am stuck at the moment, i would appreciate any help i can get. Thank you.
To read and combine multiple parquet files from an Azure Blob Storage container
List all the files in the blob container: You'll first need a list of all parquet files in the container.
Loop through the list and download and read each file: This will be done in a loop, and for each file, you'll append the data to a main dataframe.
Write the combined dataframe to a new parquet file.
Sample Code:
library(arrow)
# Function to list all files in the blob container
list_blobs <- function(container_url, key) {
# Using AzureStor package
library(AzureStor)
blob_endpoint <- paste0("https://", container_url, "/")
blob_account <- storage_endpoint(blob_endpoint, key=key)
blob_container <- storage_container(blob_account, "your-container-name")
# List the blob names
blob_list <- list_blobs(blob_container)
return(blob_list)
}
# List all the parquet files
all_files <- list_blobs("myacct.blob.core.windows.net", "wRGSJ")
all_data <- list()
# Loop through each file, download, and read
for(file_url in all_files) {
raw_data <- download_from_url(file_url, key="wRGSJ", dest=NULL)
parq_df <- read_parquet(raw_data)
# Append the data
all_data[[length(all_data) + 1]] <- parq_df
}
# Combine all data
combined_data <- do.call(rbind, all_data)
# Write to a single parquet file
write_parquet(combined_data, "combined.parquet")
list_blobs
uses the AzureStor
package to list all files in the blob container.