Search code examples
razure-storage

Looping through all file stored in a azure blob storage in R


I would appreciate your help please.

I have some .parquet files in a azure blob storage container.

I am able to individual read in just one file at a time using R

The code is:

raw_data <- download_from_url("https://myacct.blob.core.windows.net/retail.parquet",
                  key="wRGSJ", dest=NULL)
parq_df <- read_parquet(raw_data)

Instead I want to write a loop that go through, read in the files and append all of the files in the container and append them as one parquet file.

I am stuck at the moment, i would appreciate any help i can get. Thank you.


Solution

  • To read and combine multiple parquet files from an Azure Blob Storage container

    1. List all the files in the blob container: You'll first need a list of all parquet files in the container.

    2. Loop through the list and download and read each file: This will be done in a loop, and for each file, you'll append the data to a main dataframe.

    3. Write the combined dataframe to a new parquet file.

    Sample Code:

    library(arrow)
    
    # Function to list all files in the blob container
    list_blobs <- function(container_url, key) {
      # Using AzureStor package
      library(AzureStor)
      
      blob_endpoint <- paste0("https://", container_url, "/")
      blob_account <- storage_endpoint(blob_endpoint, key=key)
      blob_container <- storage_container(blob_account, "your-container-name")
      
      # List the blob names
      blob_list <- list_blobs(blob_container)
      
      return(blob_list)
    }
    
    # List all the parquet files
    all_files <- list_blobs("myacct.blob.core.windows.net", "wRGSJ")
    all_data <- list()
    
    # Loop through each file, download, and read
    for(file_url in all_files) {
      raw_data <- download_from_url(file_url, key="wRGSJ", dest=NULL)
      parq_df <- read_parquet(raw_data)
      
      # Append the data
      all_data[[length(all_data) + 1]] <- parq_df
    }
    
    # Combine all data
    combined_data <- do.call(rbind, all_data)
    
    # Write to a single parquet file
    write_parquet(combined_data, "combined.parquet")
    
    • The function list_blobs uses the AzureStor package to list all files in the blob container.