Search code examples
mulemulesoftmule4

Mule 4 Batch Process on large input file


I have a huge file that can have anywhere from few hundred thousand to 5 million records. Its tab-delimited file. I need to read the file from ftp location , transform it and finally write it in a FTP location.

I was going to use FTP connector get the repeatable stream and put it into mule batch. Inside mule batch process idea was to use a batch step to transform the records and finally in batch aggregate FTP write the file to destination in append mode 100 records at a time.

Q1. Is this a good approach or is there some better approach?

Q2. How does mule batch load and dispatch phase work (https://docs.mulesoft.com/mule-runtime/4.3/batch-processing-concept#load-and-dispatch ) Is it waiting for entire stream of millions of records to be read in memory before dispatching a mule batch instance ?

Q3. While doing FTP write in batch aggregate there is a chance that parallel threads will start appending content to FTP at same time thereby corrupting the records. Is that avoidable. I read about File locks (https://docs.mulesoft.com/ftp-connector/1.5/ftp-write#locks) . My assumption is it will simply raise File lock exception and not necessarily wait to write FTP in append mode.


Solution

  • Q1. Is this a good approach or is there some better approach?

    See answer Q3, this might not work for you. You could instead use a foreach and process the file sequentially though that will increase the time for processing significantly.

    Q2. How does mule batch load and dispatch phase work (https://docs.mulesoft.com/mule-runtime/4.3/batch-processing-concept#load-and-dispatch ) Is it waiting for entire stream of millions of records to be read in memory before dispatching a mule batch instance ?

    Batch doesn't load big numbers of records in memory, it uses file based queues. And yes, it loads all records in the queue before starting to process them.

    Q3. While doing FTP write in batch aggregate there is a chance that parallel threads will start appending content to FTP at same time thereby corrupting the records. Is that avoidable. I read about File locks (https://docs.mulesoft.com/ftp-connector/1.5/ftp-write#locks) . My assumption is it will simply raise File lock exception and not necessarily wait to write FTP in append mode

    The file write operation will throw a FILE:FILE_LOCK error if the file is already locked. Note that Mule 4 doesn't manage errors through exceptions, it uses Mule errors.

    If you are using DataWeave flatfile to parse the input file, note that it will load the file in memory and use significantly more memory than the file itself to process it, so you probably are going to get an out of memory error anyway.