Search code examples
kettlepentaho-data-integrationpdi

PDI - Read CSV Files, if missing field/data then move to the next file


I'm new with PDI and still learn about it. I'm trying to create transformation that will read all the csv file from one folder, check if the data of the file is correct, meaning there is no rows with missing/error/wrong format, then store it in a database.

What I have try is :

  1. Use Text File Input accessing CSV file in FTP using Apache Common VFS.
  2. Validate and make condition to check the data (checking filename, field if exist) in CSV using Filter Row
  3. Output into PostgreSQL Table using Syncronize After Merge. I used this because I also join CSV data with data from another table.

The result from my second step is not what I want. Currently it checks after all csv are read and pass all the data to next step but what I want is to check while read the data so it will pass only correct data to next step. How can I do that? any suggestion? (need brainstorming)

And if that impossible to implement in PDI then it's okay to read all data and pass it to the next step but then will validate again before insert the data.


Solution

  • You can only validate a file after all its data has been completely read and checked.

    The good way to do this is a job to orchestrate several transformation (one to read the directory, one to check if the files are valid, one to load the data of the validated files).

    Now writing a job may seam a daunting task until you have written 1/2 a dozen. So you can have it in one transform. In facts, it a pattern to take decisions or make computations based on indicators defined on the whole input data.

    1. Get the list of files.
    2. Read them keeping track of the filename (in the Additional output field tab).
    3. Make the check line by line as you did.
    4. Make a summary to reject if there is at least one error.
    5. Take back the main stream of 2, and for each row lookup if the filename was rejected. (The lookup stream is the result of the group by).
    6. Filter out the rows with a rejected filename.
    7. Put them on the Postgres (after enriching the data with other file or tables).

    Just a remark. In your specific case, I would change a bit the flow, testing for the accepted filename in the first filter, and removing group by and the second filter. But I thought it would be more useful for you to have the standard pattern.

    But, again, for various reason, good practice would be to do it with a master job.

    enter image description here