Search code examples
kettlepdi

Kettle PDI: better lookup and insert update or insert update + lookup


In Kettle, aka Pentaho Data Integration, I read an xls with some products linked to some categories and I insert them in a db.

The relationship category-product is 1:n (one category has more products, one product is of one category). I do the insert of category, then the insert of the product.

CASE 1:

  1. Insert/update category (really, i do insert only);
  2. Lookup category by code and return the id used in the other steps;

CASE 2:

  1. Lookup category by code;
  2. Filter row: if(id>0) then go to other steps; else go to step 3;
  3. insert category and return id;

Is better (faster/memory use) the case 1 or the case 2?

The same choose is applied to sub-category, supplyer and other related entities.

Actually I use case 1 and pdi process 4 record per second and I have files with 100k records.


Solution

  • I suggest to use the 2nd method, reading the products and for each product use the Lookup steam to find the one category of the product.

    The reason to go that way is that it is the way a human thinks. And also that the. Lookup stream (not the Lookup Database) is pretty optimized. In some case even quicker than a database left join.