Search code examples
pythongoogle-cloud-storageapache-beamapache-beam-io

apache beam trigger when all necessary files in gcs bucket is uploaded


I'm new to beam so the whole triggering stuff really confuse me. I have files that are uploaded regularly to gcs to a path that looks something like this: node-<num>/<table_name>/<timestamp>/files_parts and I need to write something that would trigger when all 8 parts of a file exist.

Their names are something like that: file_1_part_1, file_1_part_2, file_2_part_1, file_2_part_2 (there could be multiple files parts in the same dir but if its a problem I could ask for it to change).

Is there any way to create this trigger? and if not what do you suggest I could do instead?

Thanks!


Solution

  • If you are using the Java SDK, you can use a transform Watch to achieve this. I don't see a counterpart in the Python SDK though.

    I think it's better to write a program polling the files in the GCS directory. When 8 parts of a file is available, publish a message containing the file name to Pub/Sub or similar product.

    Then in your Beam pipeline, use the Pub/Sub topic as the streaming source to do your ETL.