Search code examples
google-cloud-platformpipelinegoogle-cloud-data-fusion

Union 2 datasets in Google Cloud Data Fusion


I receive regular data uploads from 2 sources that submit their data in the same structure (same columns etc). I am trying to create a pipeline within Data Fusion that will combine this data into 1 table. Is there any way to do this? I see a similar question, but given that it was asked and answered a decade ago I imagine the platform has changed quite a bit so I hope this is okay to ask.

I've tried using a join and just setting every column in one dataset equal to the same column in the other, but to no avail. The final product has null rows equaling to the number of rows in the dataset whose columns were unchecked, while the dataset with the checked columns has its data intact.

None of the other tools in fusion seem useful for this, but it is unbelievable to me that it isn't possible.

I appreciate whatever help you may be able to provide!


Solution

  • If the data ingested from both the sources having same schema, we can simply connect both the sources to the same sink in the pipeline. Internally it will do a UNION for the inputs.

    For example: In the given pipeline, it reads data from two different sources (Google Cloud Storage files) having same schema and writes to the same destination (Google Cloud BigQuery Table).

    enter image description here