I'm new with PDI and still learn about it. I'm trying to create transformation that will read all the csv file from one folder, check if the data of the file is correct, meaning there is no rows with missing/error/wrong format, then store it in a database.
What I have try is :
Text File Input
accessing CSV file in FTP using Apache Common VFS.Filter Row
Syncronize After Merge
. I used this because I also join CSV data with data from another table.The result from my second step is not what I want. Currently it checks after all csv are read and pass all the data to next step but what I want is to check while read the data so it will pass only correct data to next step. How can I do that? any suggestion? (need brainstorming)
And if that impossible to implement in PDI then it's okay to read all data and pass it to the next step but then will validate again before insert the data.
You can only validate a file after all its data has been completely read and checked.
The good way to do this is a job to orchestrate several transformation (one to read the directory, one to check if the files are valid, one to load the data of the validated files).
Now writing a job may seam a daunting task until you have written 1/2 a dozen. So you can have it in one transform. In facts, it a pattern to take decisions or make computations based on indicators defined on the whole input data.
Additional output field
tab). Just a remark. In your specific case, I would change a bit the flow, testing for the accepted filename in the first filter
, and removing group by
and the second filter
. But I thought it would be more useful for you to have the standard pattern.
But, again, for various reason, good practice would be to do it with a master job.