Is it better to put tSortRow before tUniqRow or vice versa for the best perfermence ? And how to optimize tUniqRow ? Even if I use "disk option", the job crashes. I'm working on a 3Million line file
In order to optimize your job, you can try the following:
Use the option "use disk" on tSortRow
with a smaller buffer (the default 1 million rows buffer is too big, so start with a small number of rows, 50k for instance, then increase it in order to get better performance). This will use more (smaller) files on disk, so your job will run slower, but it will consume less memory.
Try with a tSortRow
(using disk) and a tAggregateSortedRow
instead of tUniqRow
(by specifying the unique columns in the Group By section, it acts as a tUniqRow
, the columns not part of the unique key must be specified in the Operations tab each using 'First' function). As it expects the rows to already be sorted, it doesn't sort them first in memory. Note that this component requires you to know beforehand the number of rows in your flow, which you can get from a previous subjob if you're processing your data in multiple steps.
Also, if the columns you're sorting by in tSortRow
come from your database table, you can use an ORDER BY
clause in your tOracleInput
. This way the sorting will be done on the database side and your job won't consume memory for sort.