Search code examples
collectionsmuleflow-controlaggregator

Holding data processing for incomplete data sets with Mule and a collection-aggregator


I need to collect and process sets of files generated by another organization. For simplicity, say that the set consists of two files, a summary file and a detail file named like: SUM20150701.dat and DTL20150701.dat, which would constitute a set for date 20150701. The issue is, sets need to be processed in order, and the transmission of files from an outside organization can be error prone such that a file may be missing. If this occurs, this set of files should hold, as should any following sets that are found. As example, at the start of the mule process, the source folder may have in it: SUM20150701.dat, SUM20150703.dat, DTL20150703.dat. That is, the data set for 20150701 is incomplete while 20150703 is complete. I need to have both data sets hold until DTL20150701.dat arrives, then process them in order.

In this simplified form of my mule process a source folder is watched for files. When found, they are moved to an archive folder and passed to the collection-aggregator using the date as the sequence and correlation values. When a set is complete, it is moved to a destination folder. A lengthy timeout is used on the collector to make sure incomplete sets are not processed:

<file:connector name="File" autoDelete="false" streaming="false" validateConnections="true" doc:name="File">
    <file:expression-filename-parser />
</file:connector>

<file:connector name="File1" autoDelete="false" outputAppend="true" streaming="false" validateConnections="true" doc:name="File" />

<vm:connector name="VM" validateConnections="true" doc:name="VM">
    <receiver-threading-profile maxThreadsActive="1"></receiver-threading-profile>
</vm:connector>

<flow name="fileaggreFlow2" doc:name="fileaggreFlow2">
    <file:inbound-endpoint path="G:\SourceDir" moveToDirectory="g:\SourceDir\Archive" connector-ref="File1" doc:name="get-working-files"                            
             responseTimeout="10000" pollingFrequency="5000" fileAge="600000" >
        <file:filename-regex-filter pattern="DTL(.*).dat|SUM(.*).dat" caseSensitive="false"/>
    </file:inbound-endpoint>

    <message-properties-transformer overwrite="true" doc:name="Message Properties">
        <add-message-property key="MULE_CORRELATION_ID" value="#[message.inboundProperties.originalFilename.substring(5, message.inboundProperties.originalFilename.lastIndexOf('.'))]"/>
        <add-message-property key="MULE_CORRELATION_GROUP_SIZE" value="2"/>
        <add-message-property key="MULE_CORRELATION_SEQUENCE" value="#[message.inboundProperties.originalFilename.substring(5, message.inboundProperties.originalFilename.lastIndexOf('.'))]"/>       
    </message-properties-transformer>

    <vm:outbound-endpoint exchange-pattern="one-way" path="Merge" doc:name="VM" connector-ref="VM"/>
</flow>


<flow name="fileaggreFlow1" doc:name="fileaggreFlow1" processingStrategy="synchronous">
    <vm:inbound-endpoint exchange-pattern="one-way" path="Merge" doc:name="VM" connector-ref="VM"/>

    <processor-chain doc:name="Processor Chain">
        <collection-aggregator timeout="1000000" failOnTimeout="true" doc:name="Collection Aggregator"/>

        <foreach doc:name="For Each">
            <file:outbound-endpoint path="G:\DestDir1" outputPattern="#[function:datestamp:yyyyMMdd.HHmmss].#[message.inboundProperties.originalFilename]" responseTimeout="10000" connector-ref="File1" doc:name="Destination"/>
        </foreach>
    </processor-chain>

This correctly processes sets found in order if all sets are complete. It correctly waits for incomplete sets to fill, but does not hold following sets, that is in the above example set 20150703 will process while 20150701 is still waiting for the DTL file.

Is there a setting or another construct which will force the collection-aggregator element to wait if there is an earlier collection which is not complete?

I am using the date part of the file name for both the correlation and sequence ID’s which does control that sets process in the order I want if all sets are complete. It is not important if dates do not exist (as with 20150702 in this case), only that existing files are processed in order and that sets must be complete.


Solution

  • In the end, I could not get the Collection-Aggregator to do this. To overcome this, I built a Java class which contain Maps for the SUM and DTL files, with the Correlation ID as the key, and a sorted list of open keys.

    The Java class then monitored for a completed set on the smallest key and signals back to the Mule flow when that set is available for processing.

    The Mule flow must be put into synchronous mode while processing the files to prevent a data race situation. When complete, it signals the Java class that the processing is complete and the set of data can be dropped from the list/Maps, and receives an indication back if the next set is ready to process.

    It is not the prettiest, and I would have preferred to not have used custom features for this, but it gets the job done.