python google-cloud-dataflow apache-beam parquet

How to tell which file a record came from when reading multiple parquet files with google cloud dataflow

I have a requirement to be able to trace lineage back to an individual parquet file, and to be able to perform bulk loads, say to replay a few years of parquet files if a defect was to be discovered in the dataflow.

After many attempts, the following works for the bulk loads, where options.input is a RuntimeValueProvider and SplitFn just yields str.split():

with beam.Pipeline(options=PipelineOptions(), argv=args) as p:
    mainPipe = p \
        | 'CSV of URIs' >> beam.Create([options.input]) \
        | 'Split URIs into records' >> beam.ParDo(SplitFn(',')) \
        | "Read Parquet" >> beam.io.parquetio.ReadAllFromParquet(columns=[k for k in fields.keys()])

Unfortunately beam.io.parquetio.ReadAllFromParquet won't say which file each record came from, nor will ReadFromParquet, parquetio's only other function.

Short of leaving Google Cloud Dataflow or teaching the team Java, what can I do load many parquet files at once into BigQuery and know which file each record came from?

Solution

Given current API I don't see pre-made solution for this. Although you can solve problem by either:

Extending/modifying ReadAllFromParquet to append file name to output.
Use BQ tools for import from parquet. I'm not sure they have exactly same scenario though.