Search code examples
etlkettlepentaho-data-integration

Extracting behaviour from transformation


I have a couple of transformations that all do similar functionality before and after the actual ETL processing:

  • Look in certain directories to see if new files for integration have appeared (directories maintained in database)
  • Mark found files as "being used" in database (and store a couple of other file attributes)
  • Move files to a "being used" sub-directory
  • Run ETL process
  • Mark files as "finished"
  • Move files to a "finished" sub-directory

These steps are a business requirement and are 100% identical (apart from different database tables for the list of directories to search in and files registered) for each ETL process based on file input. Maintaining X copies of this near identical process (and updating all of them if it has to be changed seems non-ideal)

So, I have created a "skeleton" transformation that does all this and has the "Run ETL process" part "injected" by getting passed a variable to the location of a transformation. This way the "injected" transformation does not need to know of the file handling steps and the file handling steps are decoupled from actual ETL processes.

Since I am fairly unexperienced with PDI I was wondering how problems like these are usually solved and if my approach is any good.


Solution

  • Your approach conforms to best practice. Moving processed files in an archive directory is the way to go, and keeping the status "In use"/"Finished" in a database accessible from outside is quite convenient.

    Traditionally you would have defined the file name and other attributes in parameters on the transformation rather than in a database. However, to store them in a database allows you to access it from outside and even to make a small web app just to display the status.

    You can also extract the pre- and post-processing ETL in separate transformation and orchestrate the whole process in a job.

    I suppose you did not forgot to include the file size (number of rows) among the couple of other attributes. And that you have a naming convention to help you re-run the process in case of failure without headache.

    And finally, I suggest you also keep track of the logs in the database. Just click anywhere, select Properties/Parameters/Transformation, define a database connection+table, an logging interval of 2 (seconds), and press the SQL button. All what you see in the bottom Step metrics table (Input, Output,...) will be kept in the database. Useful when you discover that something went wrong last week.