Search code examples
etlkettlepentaho-spoonpentaho-data-integrationspoon

How to download a CSV from a HTTPS URL to file using Pentaho Data Integration - Spoon (Kettle)?


When googling this question, it seems to have been asked, and partially (and poorly) answered a number of times, mostly for older versions.

Question: How can I download a CSV to a local file, with the below constraints? I'm designing in Spoon.

URL: Will always be the same. https://example.com/data/my.csv . The website prepares the csv and provides it back to the web client as a file download after about 4-5 seconds. In a browser this means it is downloaded as a .csv, and not displayed.

Authentication: The website does not require authentication for access. The data isn't sensitive.

Local file path: The downloaded CSV will overwrite the existing csv. eg: d:\data\my.csv . Ie, I can set this on a timer and have it download the newest csv every hour or so.

Proxy: It is quite likely I will need to traverse a network proxy. eg badproxy.mynetwork.internal:8080 and that proxy requires a username and password. It's far better if I can set this password in a single location so any future things created can reference it. Not really sure on how to approach this either.

The rest of my process focuses on addressing the content of the csv, and already works fine.

The processes I've found on google show using the Http Client component, though it's not particularly straightforward how this translates into a file being saved locally into a known location.

Thanks for any pointers.

PDI v9.0.0.0-423


Solution

  • The HTTP client step needs to be triggered. Use a Row generator step generating e.g. 1 empty row and link that with a hop to the HTTP client step. for your solution , try this: Data Grid -->HTTP Client-->CSV File Input->Text file output(extension with csv)