I have a couple of transformations that all do similar functionality before and after the actual ETL processing:
These steps are a business requirement and are 100% identical (apart from different database tables for the list of directories to search in and files registered) for each ETL process based on file input. Maintaining X copies of this near identical process (and updating all of them if it has to be changed seems non-ideal)
So, I have created a "skeleton" transformation that does all this and has the "Run ETL process" part "injected" by getting passed a variable to the location of a transformation. This way the "injected" transformation does not need to know of the file handling steps and the file handling steps are decoupled from actual ETL processes.
Since I am fairly unexperienced with PDI I was wondering how problems like these are usually solved and if my approach is any good.
Your approach conforms to best practice. Moving processed files in an archive directory is the way to go, and keeping the status "In use"/"Finished" in a database accessible from outside is quite convenient.
Traditionally you would have defined the file name and other attributes in parameters on the transformation rather than in a database. However, to store them in a database allows you to access it from outside and even to make a small web app just to display the status.
You can also extract the pre- and post-processing ETL in separate transformation and orchestrate the whole process in a job.
I suppose you did not forgot to include the file size (number of rows) among the couple of other attributes. And that you have a naming convention to help you re-run the process in case of failure without headache.
And finally, I suggest you also keep track of the logs in the database. Just click anywhere, select Properties
/Parameters
/Transformation
, define a database connection
+table
, an logging interval
of 2 (seconds), and press the SQL
button.
All what you see in the bottom Step metrics
table (Input, Output,...) will be kept in the database. Useful when you discover that something went wrong last week.