Search code examples
sqloutputpentahoetlkettle

Using Pentaho Kettle, can I input data to database tables and generate joins without combining datastreams beforehand?


I have found this question and answer already, and together they provide most of the answer to my problem, in format entirely intelligible to a complete novice!

I have an additional query to hopefully fill in some of the gaps in both my process and my understanding:

I have a series of XML files each containing all the information for a single person, instead of a single CSV containing multiple people, as the salient input. This information requires splitting and manipulation, so I have multiple streams in my Kettle transformation to reflect this (and will have a 'for each file' loop set up in the parent Job to handle multiple files).

To use the method outlined in the selected answer to import my processed data from a person file into a database, do I need to recombine my many processed datastreams into several streams within which all the data to be joined resides, or is there an alternate way of approaching this?

i.e. If I have datastreams A, B, C, D & E in my Kettle transformation, and within my database A is joined to B & C, and D to E, do I necessarily need to combine streams A, B & C into one stream and D & E into another?

Thanks in advance


Solution

  • Update: I have now solved this issue. To do so involved (as many problems) coming at it from a different direction, but I'm adding my solution to possibly aid a similarly unskilled Kettle user:

    I used the operator outlined in the linked question (combined insert/update) to insert/update the data into my 'outermost' tables for each file (streamed using the StAX operator), my schema being much akin to a spider's web, with a nodal 'core' table of unique information for each file, branching off into tables where I aim to store single incidences of data common to many files, e.g. country.

    The data streams in my transformation now being in the database, I join streams incrementally (e.g. place+city+country streams after processing and insertion are related to generate a location entity in a location table) producing the requisite relationships. This eventually results in producing join tables from e.g. my location table to the 'core' table, such that all the disparate data for a given XML file is appropriately connected.