Search code examples
palantir-foundryfoundry-code-repositories

How to force an incremental Foundry Transforms job to build non-incrementally without bumping the semantic version?


How can I force a particular dataset to build non-incrementally without changing the semantic version in the transforms repo?

Details about our specific use case:

We have about 50 datasets defined by a single incremental python via manual registration and a for-loop. The input to this transform can be between 100's and 10000's of small gzip files, so when the larger dataset runs, it ends up partitioning all of these into only a handful of well-sized parquet files, which is perfect for our downstream jobs. However, after this job has been running incrementally for months (with files arriving every hour), there will also be a large number of small parquet files in the output. We'd like to be able to force a snapshot build of this single dataset without having to bump the semantic version of the transform which would trigger snapshot builds for all 50 datasets. Is this possible?

I understand a potential workaround could be defining a "max output files" in the transform itself, reading the current number of files in the existing output, and forcing a snapshot if the current exceeds the maximum. However, since this pipeline is time sensitive (needs to run in under an hour), this would introduce a level of unpredictability to the pipeline since the snapshot build takes much longer. We'd like to be able to set these full snapshot builds to run about once a month on a weekend.


Solution

  • Commit an empty append transaction on the output dataset.