database pentaho-spoon pentaho-data-integration

How to get all results from results set in pentaho kettle step Input table?

I have simple transformation consisting of 2 steps. 1 step (Input table) makes query to DB and 2 step (Java class) processes results. 2 step takes much time (it is normal in my case) but after 1 hours I get error of closed results set

Server has closed the connection. If result set contain huge amount of data, Server expects client to read off the result set relatively fast. In this case, please consider increasing net_wait_timeout session variable. / processing your result set faster (check Streaming result sets documentation for more information) 2017/10/02 13:12:06 - Getting of data cells .0 -

I think there should be some intermediate step (or some other option) to get relatively fast all result from 1 step. Could you help me with that?

Solution

I guess your step 2 is locking the same table as the one in step 1.

That's one of the drawback of the otherwise efficient architecture of the PDI. All the steps startup at the same time, and the quickest to produce results give the hand to the next steps. With this strategy of "do the quickest first", you sometimes beat the sql optimizer itself when there is lots of joins on sums or averages (pro rata).

The main pitfall in this respect is to read a table, make some transformation and rewrite the result on the same table with the truncate table checked. In that case, the truncate is done a few milliseconds before the select of the input table which starts an infinite dead lock. After a long time you decide to kill the ETL, but at that time the data has been lost.

Solutions:

The best practice is to rewrite step2 using PDI steps rather than to use a ready made java class. That is the way I strongly recommend on the long run, but you may have some reason not follow it.
If your table is small, you can put a blocking step between the input and output.
If you table is big, you can use a sort row step instead of the blocking step. You do not really want to sort, but the PDI needs to look at the last row to be sure the sort is complete, before to give results to the next step. The sort will cut the data in temporary chuncks on the hard disk, and you can have a certain control on where and how the tmp data is stored.
You can copy your table in a tmp table (or file), process and delete it after. Use a job to do that, because in a job, unlike in a transformation, the process is sequential.