How can I force a particular dataset to build non-incrementally without changing the semantic version in the transforms repo?
Details about our specific use case:
We have about 50 datasets defined by a single incremental python via manual registration and a for-loop. The input to this transform can be between 100's and 10000's of small gzip files, so when the larger dataset runs, it ends up partitioning all of these into only a handful of well-sized parquet files, which is perfect for our downstream jobs. However, after this job has been running incrementally for months (with files arriving every hour), there will also be a large number of small parquet files in the output. We'd like to be able to force a snapshot build of this single dataset without having to bump the semantic version of the transform which would trigger snapshot builds for all 50 datasets. Is this possible?
I understand a potential workaround could be defining a "max output files" in the transform itself, reading the current number of files in the existing output, and forcing a snapshot if the current exceeds the maximum. However, since this pipeline is time sensitive (needs to run in under an hour), this would introduce a level of unpredictability to the pipeline since the snapshot build takes much longer. We'd like to be able to set these full snapshot builds to run about once a month on a weekend.
Commit an empty append transaction on the output dataset.