Search code examples
palantir-foundry

How do you merge raw data feeds and connections into one in Palantir Foundry?


We have multiple raw datasets of loose files that we want to merge e.g. N raw datasets, each with a Data Connection and schedule.

These datasets are just loose files - there are no schemas or anything involved.

We want to merge these files into a single raw dataset, to simplify our raw ingestion (especially cutting down on Data Connections, which are very cumbersome to change)

One complicating factor is the history of the data is not available via the Data Source, so we'd need to merge the existing raw datasets to keep the history.

The most straightforward solution seemed to be copying the raw data via a transform into a new dataset, and then create a new Data Connection to update it.
But Foundry doesn't seem to support this.


Solution

  • One dataset is only defined in one place (being data connection, a transform, etc.). You can't "configure" a dataset to be "generated via a transform" AND "generated by a Data Connection".

    However, there should be no blocker to have multiple datasets and then union them in a second step. You can ingest your historical data in one dataset A, upload manually or by connecting to the source system that contains this data. You can ingest your ongoing data in one Data Connection Sync, in a dataset B. Then you can transforms (via Pipeline Builder, Code Repository) those 2 datasets into a third one : dataset C, that will be the union of A and B.

    If you are worried about the time or scale of the union, and if the datasets schema matches exactly, you can use a View (right click in a folder > Create View) to union those datasets virtually.

    EDIT: Per clarification, if raw files then the only solution I see is to union those datasets downstream during the first step of the processing you apply on them. I assume you will parse them (Code Repository or Pipeline Builder) and hence you can have a first step which is to union both. There shouldn't be any downside of doing this (incremental still works, etc.)