Is it possbile to store processed files into where it was stored initially, using Google-provided utility templates?

One of the Google Dataflow utility templates allows us to do compression for files in GCS (Bulk Compress Cloud Storage files).

While it is possible to have multiple inputs for the parameter that consist of different folders (e.g: inputFilePattern=gs://YOUR_BUCKET_NAME/uncompressed/**.csv,), is it actually possible to store the 'compressed'/processed files into the same folder where it was stored initially?

Solution

If you have a look at the documentation:

The extensions appended will be one of: .bzip2, .deflate, .gz.

Therefore, the new compressed files won't match the provided pattern (*.csv). And thus, you can store them in the same folder without conflict.

In addition, this process is a batch process. When you look deeper in the dataflow IO component, especially to read with a pattern into GCS, the file list (of file to compress) is read at the beginning of the job and thus don't evolve during the job.

Therefore, if you have new files that come in and which match the pattern during a job, they won't take into account by the current job. You will have to run another job to take these new files.

Eventually, a last thing: the existing uncompressed files aren't replaced by the compressed ones. That means you will have the file in double: compressed and uncompressed version. To save space (and money) I recommend you to delete one of the two version.