Search code examples
kettlepentaho-data-integration

Join data from different sources in Kettle


My database has the following tables:

USER
userid INT
username STRING

USER_SESSION_HISTORY
userid INT (foreign key to USER)
login_date DATETIME

Then I have a CSV with this header:

username;login_date

I need to insert the CSV data into the USER_SESSION_HISTORY table. As you can see, I need to join the two data sources (the USER table and the CSV file) to be able to get the user id.

I'm new to Kettle and just got to learn very simple transformations until now.


Solution

  • You need to read the CSV and the USER table, after those two steps, you need to add a Sort step for each one (check for the case sensitive/insensitive possibilities in the Sort step), afterwards, with a merge join step you merge both streams of data, you set it up as a LEFT OUTER JOIN, on the left the CSV data (coming from the sort step), on the right the USER data (coming from the sort step), so for each username in the CSV you look if the username already exists in the USER table.

    Following the MERGE JOIN, you put a Filter step, looking if the userid data IS NULL, if it's NULL, that means that the username in the CSV doesn't exist, so you'll need to insert it first in the USER table.

    If you have the userid (filter condition false in the previous step) you can simply insert the data in the USER_SESSION_HISTORY with the userid you retrieved from the MERGE JOIN.

    For the true condition in the filter (no previous data in the user data), it's going to depend on how you generate the userid. If you have a sequence associated to the userid, and the column is automatically filled with the corresponding next value of the column, you can insert the username in the USER table and the database will take care of filling the userid. If you can't simply insert the username in the USER table, you'll have to add some intermediate steps to generate the userid depending on how you handle it in the database.

    I don't know if after inserting in the USER table you'll be able to see the userid generated, you can test it. If you have it available in this transformation, you can add a block step after the Insert step in the USER table, waiting for the Insert step in the USER_SESSION_HISTORY to finish (for the data where the username was already present in the USER table, the false condition in the filter step). The block step is needed because Pentaho runs all the steps at the same time unless you use this block step, so the USER_SESSION_HISTORY table would be blocked with two transactions executing at the same time. After the block step, you add a second Insert step for the USER_SESSION_HISTORY table.

    If you don't have available the userid available after inserting, I think the easier way to work would be with two transformations, first one to insert the new usernames in the USER table, and a second transformation to insert the data in the USER_SESSION_HISTORY, in the second transformation we are sure we already have all the usernames available in the USER table Transformation  mockup