I'm new to beam so the whole triggering stuff really confuse me.
I have files that are uploaded regularly to gcs to a path that looks something like this: node-<num>/<table_name>/<timestamp>/files_parts
and I need to write something that would trigger when all 8 parts of a file exist.
Their names are something like that: file_1_part_1, file_1_part_2, file_2_part_1, file_2_part_2
(there could be multiple files parts in the same dir but if its a problem I could ask for it to change).
Is there any way to create this trigger? and if not what do you suggest I could do instead?
Thanks!
If you are using the Java SDK, you can use a transform Watch to achieve this. I don't see a counterpart in the Python SDK though.
I think it's better to write a program polling the files in the GCS directory. When 8 parts of a file is available, publish a message containing the file name to Pub/Sub or similar product.
Then in your Beam pipeline, use the Pub/Sub topic as the streaming source to do your ETL.