Search code examples
palantir-foundryfoundry-code-repositories

What exactly triggers a pipeline job to run in Code Repositories?


Want to understand when the pipeline job runes so I can more effectively understand the pipeline build process. Does it check the code change from master branch of the Code Repository?


Solution

  • Building a job on the pipeline, builds the artifact that was delivered on the instances, not what has been merged onto master.

    It should be the same, but there is a checking process after the merge onto master and before the delivery of the artifact, like you would have on a regular Git/Jenkins/Artifactory.

    So there is a delay.

    And moreover if these checks don't pass, your change, even though merged onto master, will never appear on the pipeline.


    To add a bit more precision as to what @Kevin Zhang wrote. There's also the possibility to trigger a job using an API call, even though it's not the most common.

    Also you can combine the different events to say things like

    • Before work hours
      • build only if the schedule of the morning update has succeeded
    • during work hours
      • build every hour
        • if an input has new data
        • and
          • if a schedule has run successfully
          • or another dataset has been updated
    • after hours
      • build whenever an input has new data

    It can also helps you create loops, like if you have a huge amount of data coming in input B and it impacts your sync toward the ontology, or a timeserie,... , you could create a job that takes a limited number of rows from input B and log the id's of these in a table to not retake them again, you process those rows and when the output C is updated you rerun your job and when there is no more row you update output D.
    You can also add a schedule on the job that produces input B from input A stating to rerun it only when output D is updated.
    This would enable you to process a number of files from a source, process the data from those files chunk by chunk and then take another batch of files and iterate.

    By naming your schedule functionnaly you can have a more controlled build of your pipeline and finer grain of data governance and you can also add some audit table or log tables based on these schedules, which will make debugging and auditing way more easy.
    You would have a trace of when and where a specific source update has reach.

    Of course, you need such precision only if your pipeline is complexe : like many different sources, updated at different time and updating multiple part of your pipeline.
    For instance if you're unifying the data of your client that was before separated in many silos or if it is a multinational group of many different local or global entities, like big car manufacturers