Search code examples
kettlepentaho-data-integrationpdi

PDI - Multiple file input based on date in filename


I'm working with a project using Kettle (PDI). I have to input multiple file of .csv or .xls and insert it into DB.

The file name are AAMMDDBBBB, where AA is code for city and BBBB is code for shop. MMDD is date format like MM-DD. For example LA0326F5CA.csv.

The Regexp I use in the Input file steps look like LA.\*\\.csv or DT.*\\.xls, which is return all files to insert it into DB.

Can you indicate me how to select the files the file just for yesterday (based on the MMDD of the file name).


Solution

  • As you need some "complex" logic in your selection, you cannot filter based only on regexp. I suggest you first read all filenames, then filter the filenames based on their "age", then read the file based on the selected filenames.

    In detail:

    1. Use the Get File Names step with the same regexp you currently use (LA.*\.csv or DT.*\.xls). You may be more restrictive at that stage with a Regexp like LA\d\d\d\d.....csv, to ensure MM and DD are numbers, and DDDD is exactly 4 characters.

    2. Filter based on the date. You can do this with a Java Filter, but it would be an order of magnitude easier to use a Javascript Script to compute the "age" of you file and then to use a Filter rows to keep only the file of yesterday.

      To compute the age of the file, extract the MM and DD, you can use (other methods are available):

      var regexp = filename.match(/..(\d\d)(\d\d).*/);
      if(regexp){
          var age = new Date() - new Date(2018, regexp[1], regexp[2]);
          age = age /1000 /60 /60 /24;
          };

    If you are not familiar with Javascript regexp: the match will test the filename against the regexp and keep the values of the parenthesis in an array. If the test succeed (which you must explicitly check to avoid run time failure), use the values of the match to compute the corresponding date, and subtract the date of today to get the age. This age is in milliseconds, which is converted in days.

    1. Use the Text File Input and Excel Input with the option Accept file from previous step. Note that CSV Input does not have this option, but the more powerful Text File Input has.