Search code examples
pythongoogle-cloud-dataflowapache-beamparquet

How to tell which file a record came from when reading multiple parquet files with google cloud dataflow


I have a requirement to be able to trace lineage back to an individual parquet file, and to be able to perform bulk loads, say to replay a few years of parquet files if a defect was to be discovered in the dataflow.

After many attempts, the following works for the bulk loads, where options.input is a RuntimeValueProvider and SplitFn just yields str.split():

with beam.Pipeline(options=PipelineOptions(), argv=args) as p:
    mainPipe = p \
        | 'CSV of URIs' >> beam.Create([options.input]) \
        | 'Split URIs into records' >> beam.ParDo(SplitFn(',')) \
        | "Read Parquet" >> beam.io.parquetio.ReadAllFromParquet(columns=[k for k in fields.keys()]) 

Unfortunately beam.io.parquetio.ReadAllFromParquet won't say which file each record came from, nor will ReadFromParquet, parquetio's only other function.

Short of leaving Google Cloud Dataflow or teaching the team Java, what can I do load many parquet files at once into BigQuery and know which file each record came from?


Solution

  • Given current API I don't see pre-made solution for this. Although you can solve problem by either:

    • Extending/modifying ReadAllFromParquet to append file name to output.
    • Use BQ tools for import from parquet. I'm not sure they have exactly same scenario though.