Search code examples
pentahodata-warehousepentaho-spoonpentaho-data-integrationdata-integration

Reading data to table input pentaho


I have questions. Maybe someone will help me. I have transformation as below enter image description here

In Table_input I read data from the MSSQL database. In Table_input_2 i read data from DB2 database. In Merge join I am comparing ID with each other and the rest of the attributes in the tables. In Filter rows I find the records on which the same ID is the difference. For now, I write the difference in the xlsx file. In Table_input and Table_input2 there are many records-over 100 mln. Often differences result from query execution times on MSSQL and DB2 databases. I would like the process to be carried out again for the differences returned. I would like to be sure that the differences resulted from the time the query was completed. Can anyone give me a hint how to do it?


Solution

  • From your short description, you have two source databases that should be in sync, but aren't always. You are using this Pentaho job to find the differences, but when running you find false positives, differences that disappear in a short time due to the systems catching up again.

    If you want to find differences that are persistent, your options depend on the circumstances of your operational systems.

    If you're lucky, you have (or can get) rowversion from the MSSQL tables. These get updated when a row is changed. If the replication to DB2 includes these you know exactly whether you're comparing the same version of the row.

    However, you likely wouldn't be asking here if you have the above already in place, so for now I'll assume you have to make do without reliable timestamps. Also, probably no quiet hours that see very few changes.

    In your situation, I would process "small" rowsets of 10k, 100k or 1M rows at a time in the transformation and use a parent job to increased the starting ID each time. This reduces query times and hopefully differences.

    You can see a complete example in one of my earlier answers here. It's slightly different because that one was for an offset/limit with unknown total records.

    The idea is that your transformation takes parameters as input telling it at which ID to start and how many rows to take. Then it processes those and reports back either the number of records processed. If the last query got close to 0 records back, you're done. Due to realtime changes, it might not reach 0 exactly.

    The parent job checks if you're done and if not updates the new START_ID and runs the transformation again.

    Your Table inputs would change like this for DB2:

    SELECT ...
    FROM TABLE
    WHERE ID >= ${START_ID} 
    ORDER BY ID
    LIMIT ${BATCH_SIZE}
    

    and MSSQL (it's been a long time, may not be correct):

    SELECT TOP ${BATCH_SIZE} ...
    FROM TABLE
    WHERE ID >= ${START_ID}
    ORDER BY ID
    

    If the ID is numerical, you can probably drop the Sort Rows steps in the transformation, saving a lot of time. If alphanumerical, keep them.

    The Merge(diff) step does what you do in Merge Join and Filter rows, but needs identical field layout. Might be an option.

    Finally, I would add Get System Info for a timestamp, then store any changed IDs in a table with the timestamp you detected them. Then report only on differences that persist for more than X minutes/hours/runs.