Search code examples
transformationkettlepentaho-spoonpentaho-data-integrationspoon

Get data with limits


I am very new in Kettle Transformations but I have detected I have a problem in a project I am working on.

A GET transformation calls to a server, the server returns the data in JSON format. The problem I have is that the data is a very big amount of JSON,lets says 80.000 JSON documents, so sometimes the server goes down.

I wonder if I can set a limit of JSON got in the transformation itself, in other words: I want to get 3000 JSON and after that the next 3000 JSON .

Is there a way to do it with transformations? Here is how I get the data enter image description here

I am trying with

&limit=3000

in the URL I call but I just get first 3000 documents, and I need to get 3000 documents work with it and then the next 3000 ...


Solution

  • Not in the PDI step, unless you can specify the limit and offset parameters with the url. These parameters have to be defined on the server which provides you the data. And usually the developers of the api codes these parameters because they know some persons like you will otherwise download tons of data. Unfortunately, this is best practice not a norm, so it could not be implemented in the Data Integrator.

    Have a try. And for that use the parameter tab rather than the ?limit=&offset= in the url. Like that the values may come from a previous step, and you'll be able to read the server by chunks.

    You may also increase the Response time which is the max time your PDI will wait for a response from the server before deciding the server is down.

    You may also catch the error of the REST Client step, either in a main job either by selecting it when you drop your mouse to define the step. In that case you may have add some extra logic to restart the process 15 mn later when the http fails. If you choose this solution, care however to stop after 3 or 5 trials, otherwise you may fill the memory of idle processes.