I am trying to ingest messages from IBM MQ using Apache Flume. I got the below configurations:
# Source definition
u.sources.s1.type=jms
u.sources.s1.initialContextFactory=ABC
u.sources.s1.connectionFactory=<my connection factory>
u.sources.s1.providerURL=ABC
u.sources.s1.destinationName=r1
u.sources.s1.destinationType=QUEUE
# Channel definition
u.channels.c1.type=file
u.channels.c1.capacity=10000000
u.channels.c1.checkpointDir=/checkpointdir
u.channels.c1.transactionCapacity=10000
u.channels.c1.dataDirs=/datadir
# Sink definition
u.sinks.r1.type=hdfs
u.sinks.r1.channel=c1
u.sinks.r1.hdfs.path=/message/%Y%m%d
u.sinks.r1.hdfs.filePrefix=e_
u.sinks.r1.hdfs.fileSuffix=.xml
u.sinks.r1.hdfs.fileType = DataStream
u.sinks.r1.hdfs.writeFormat=Text
u.sinks.r1.hdfs.useLocalTimeStamp=TRUE
The issue is when i am ingesting the messages, 2 messages are getting clubbed together into 1 single message.
For e.g: Suppose Source sends out 3 xml messages:
<id>1</id><name>Test 1</name>
<id>2</id><name>Test 2</name>
<id>3</id><name>Test 3</name>
When i recieve the same messages in HDFS, get the messages in 2 xml files as below:
event_1.xml
<id>1</id><name>Test 1</name>
<id>2</id><name>Test 2</name>
event_2.xml
<id>3</id><name>Test 3</name>
Expected result is to have all the 3 xml messages in 3 separate file in HDFS like event_1.xml; event_2.xml; event_3.xml
Solved it using the below configuration in the sink:
hdfs.rollSize=0
hdfs.rollInterval=1
hdfs.rollCount=1
This helped in ingesting the messages as a single message instead of aggregate two messages into one.