Search code examples
pentahoetlkettledata-integrationpdi

Comparing filenames in PDI


I am trying to import a certain .CSV file into my database using PDI (Kettle).

Normally this would be rather easy, as you could just link up a CSV file input step with a Table output step and be good to go. However, the problem is that I don't know which file I want to import in advance, as in before executing the job/transformation in PDI.

That is because I have many files in my import folder, which all have the same format regarding their filename: KeyDate_Filename_YYYYMMDD.CSV

The idea is to have file with the newest YYYYMMDD imported for a given key date.

My theoretical approach to implement this would be:

  • Make the given key date available in PDI as a parameter (already done)
  • Read in the names of all files stored in the import folder
  • Filter said filenames for the given key date
  • Compare the YYYYMMDD of the remaining files and select the newest
  • Use selected filename as parameter in a CSV file input step (already done)
  • Import data via Table output step (already done)

Unfortunately I am fairly new to PDI and don't really have a compelling idea on how to implement the bold parts or if that approach as a whole is even viable.

Can anybody think of a way to get this done? Appreciate any feedback


Edit: Forgot to mention that I am using PDI 3.2.6.

In 4.x.x I could simple use a User Defined Java Class to get this done :/


Solution

  • There are various ways to achieve that result. Here's one:

    enter image description here

    1. Get filenames lists all files within a specific folder that match a given pattern. As ${KeyDate} is already defined as a parameter, the pattern could be

      ${KeyDate}[^]_[0-9].csv

    (you can use a simpler regex, but this one will match only filenames in that format;

    1. With a regex evaluation you get the date: ${KeyDate}_[^_]*_([0-9]*)\.csv. Remember to tick the "create fields for capture groups" checkbox;
    2. Order rows by that newly created date field.
    3. Group by (without a key field) and take first value of filename (if asc order) or last value (if desc order).

    The output of the Group by step is a single row with the most recent filename that matches your pattern.

    Now you pass it to the CSV file input, telling it to "accept filenames from previous step", and specifying which field to use (default is filename).