I have a requirement to be able to trace lineage back to an individual parquet file, and to be able to perform bulk loads, say to replay a few years of parquet files if a defect was to be discovered in the dataflow.
After many attempts, the following works for the bulk loads, where options.input is a RuntimeValueProvider and SplitFn just yields
str.split()
:
with beam.Pipeline(options=PipelineOptions(), argv=args) as p:
mainPipe = p \
| 'CSV of URIs' >> beam.Create([options.input]) \
| 'Split URIs into records' >> beam.ParDo(SplitFn(',')) \
| "Read Parquet" >> beam.io.parquetio.ReadAllFromParquet(columns=[k for k in fields.keys()])
Unfortunately beam.io.parquetio.ReadAllFromParquet won't say which file each record came from, nor will ReadFromParquet
, parquetio
's only other function.
Short of leaving Google Cloud Dataflow or teaching the team Java, what can I do load many parquet files at once into BigQuery and know which file each record came from?
Given current API I don't see pre-made solution for this. Although you can solve problem by either: