python python-3.x google-bigquery google-cloud-storage parquet

Add partition columns of Parquet files from Google Cloud Storage to BigQuery

I have Parquet files stored in a Google Cloud Storage Bucket with paths such as: gs://some_storage/files.parquet/category=abc/type=xyz/partition.parquet

Each parquet file has the fields: {'date':'2023-03-01','value': 2.718}

I am loading these fields to BigQuery and I would need to include the partition columns, i.e. category and type in the final table, so that the event would have the fields: {'date':'2023-03-01','value': 2.718, 'category': 'abc', 'type': 'xyz'}

Currently i'm iterating over the directory gs://some_storage/files.parquet, extracting the category and type partitions with a regexp from the paths, appending the values to the parquet file at time of read and inserting to Bigquery.

There must be a better way since this form of partitioning is standard with parquet files. Is there any method, either via pyarrow or google cloud services that will read in the partition directly without having to iterate over paths and using a regexp? Or better is there any way I can result in the data in a BigQuery table including category and type columns?

Thank you in advance for any help.

My current method looks like this:

import re
import gcsfs
import pyarrow.parquet as pq

fs = gcsfs.GCSFileSystem(project='gcs-project')

# extract paths
paths = []
root_dir = 'gs://some_storage/files.parquet'
category_paths = fs.ls(root=root_dir)
for category_path in category_paths:
    feature_paths = fs.ls(category_path)
    for file_path in feature_paths:
        [file_paths] = fs.ls(file_path)
        paths.append(file_paths)


# read and append partition columns
for path in paths:
    category = re.search(r'category=(.*?)/', path).group(1)
    feature = re.search(r'feature=(.*?)/', path).group(1)

    df = pq.ParquetDataset(path)

    # append the category and feature on the df
    df['category'] = category
    df['feature'] = feature

# finally insert to bigquery

Solution

There must be a better way since this form of partitioning is standard with parquet files.

One possible option is to use an external table. Assuming that you have parquet files stored in below paths

You can create an external table for Hive partitioned data stored in GCS above.

CREATE OR REPLACE EXTERNAL TABLE `your-project.your-dataset.stg_table` 
  WITH PARTITION COLUMNS (
    category STRING,
    type STRING
  )
OPTIONS (
  uris=['gs://some_storage/files.parquet/*'],
  format='PARQUET',
  hive_partition_uri_prefix = 'gs://some_storage/files.parquet',
  require_hive_partition_filter = false
);

If you query the table,

SELECT * FROM `your-project.your-dataset.stg_table`;

you will get below result

You can submit above sql scripts to BigQuery via Python BigQuery SDK.

See also

Create an external table for Hive partitioned data