In Kettle, aka Pentaho Data Integration, I read an xls with some products linked to some categories and I insert them in a db.
The relationship category-product is 1:n (one category has more products, one product is of one category). I do the insert of category, then the insert of the product.
CASE 1:
CASE 2:
Is better (faster/memory use) the case 1 or the case 2?
The same choose is applied to sub-category, supplyer and other related entities.
Actually I use case 1 and pdi process 4 record per second and I have files with 100k records.
I suggest to use the 2nd method, reading the products and for each product use the Lookup steam
to find the one category of the product.
The reason to go that way is that it is the way a human thinks. And also that the. Lookup stream
(not the Lookup Database
) is pretty optimized. In some case even quicker than a database left join.