I have a Pentaho job and in one of the transformations I want to get the number of files in a folder. I have tried two different approaches but both took over 2 minutes to execute. I would like to know if there is a step that could be use to do this in a more efficient manner.
Approach 1 - Get File rows count -> Set Variables
In my Get File rows count step I have the directory and a wildcard (.*.xml) to get the count of xml files in a folder. In the content tab I have the file count saved to a field (fileCount) which is then saved to a variable. For a folder with 3,722 xml files it took 2:15 to run.
Approach 2 - Get File Names -> Group By -> Set Variables
With this approach I have a similar setting as the step 'Get File rows count' but then after I am doing a group by action with the type 'Number of rows (without field argument)'. This method ran in 2:30 for the same 3,722 files.
I think these are taking so long because it is trying to get the files in memory but I only care about the count. Was hoping to see a way to just get the count.
The Get Files Rows count step will count every line in every file, so no wonder it's slow.
Use the Get File Names step and it should be very quick, regardless of file sizes. Luckily I had a folder full of XML files ready, so here is a screenshot of what to expect (in a linux VM on my laptop)
If you are using this step and still having issues, first make sure you have removed the other input steps from the transformation, as they will still be running and possibly interfering. Second, check if antivirus software is trying to scan every file when Spoon accesses them for the metadata.