Search code examples
palantir-foundry

Optimal Data Sync Strategy for Palantir Foundry: Handling Incremental Updates Based on Timestamps


I have one table in the source and I want to sync it into Palantir Foundry by keeping in mind the below points

  • Assume it has "ID" as the primary key, CREATED_DATE, and MODIFIED_DATE columns
  • I want it to load all new data that comes into the table and also update the data that is already in the foundry by identifying the MODIFIED_DATE column.

so which transaction type I can use, do I need to on Incremental, and which query I can write to achieve the above scenario?

I have tried using the Append Transaction type with Incremental on, but due to this configuration, I'm not able to update the old data which is in the foundry using the MODIFIED_DATE column.

Also, I want a better solution than using Snapshot which is not that much good.


Solution

  • Looking at the JDBC documentation (assuming you connect via JDBC) you will need to set the incremental on the MODIFIED_DATE column, as If your rows are mutable (...), you'll need a column that increases with every mutation of the data (e.g. an update_time column).

    The "good" solution depends on what you want to do with the data downstream, namely, if you need to have the history of changes or want to keep "the latest only".

    As the documentation suggests When you ingest an updated version of an existing row, the Foundry dataset will still include previous versions of the row (remember, we're using the APPEND transaction type). If you want only the latest version of each row, you will need to use another tool in Foundry, such as Transforms, to clean the data.

    You can very well ingest with "Append" and if a row has been "updated" then it will be re-added as a new row. Then you can deduplicate downstream, for instance:

    • with a View (from a folder: Actions > New > View, and you can configure on which column to deduplicate - in your case MODIFIED_DATE)
    • with a merge-and-update transform - the output transaction will always be a snapshot, but you will incrementally read your input and merge it with the "latest version" of your output.
    • With an incremental transform with your customer logic (e.g. if you want to put the "modified" rows in a dataset and the "new ones" in another one, etc.)

    It is worth noting that Object syncing supports incremental, which means that new rows versions with "conflicting" primary key, will be reflecting in "only the latest version is available as an object".

    e.g. in Transaction 1

    PK 1 | Object A | Value 1  
    PK 2 | Object B | Value 2 
    

    and then you later receive in Transaction 2:

    PK 1 | Object A | Value 10 
    

    will result in your Ontology:

    PK 1 | Object A | Value 10  
    PK 2 | Object B | Value 2 
    

    Note: You need to deduplicate primary keys within each new transaction, as if you provide a new transaction with PK 1 | Object A | Value 30 AND PK 1 | Object A | Value 20 at the same time, then Foundry has no way to know which row version to keep.

    EDIT - Deletion of rows

    • Views support a "deleted" boolean field, which does exactly this. Fill it with False = the row is kept, fill it with True = the row is deleted.
    • Merge-and-update should handle it (given you re-write your output, you can delete rows).
    • In the more general case, if you want to keep an incremental pipeline, you might just want to propagate a row with a deleted boolean column and handle it downstream (which then depends on the exact computation - e.g. if there is an aggregation then you need to subtract or recompute it, etc.). Populating the deleted column in the first place depends on the source system.