Search code examples
apache-kafkaflume

How do i access full dataset in flume-to-kafka pipeline?


I am reading a text file SMSSpamCollection as a flume-source, posting it to kafka topic which is a flume-sink.

     # Agent Name:
     a1.sources = r1
     a1.sinks = sample
     a1.channels = sample-channel


     # Source configuration:
     a1.sources.r1.type = exec
     a1.sources.r1.command = tail -f /Users/val/Documents/code/spark/m11_to_Upload/SMSSpamCollection
     a1.sources.r1.logStdErr = true

     # Sink type
     #a1.sinks.sample.type = logger

     # Buffers events in memory to channel
     a1.channels.sample-channel.type = memory
     a1.channels.sample-channel.capacity = 1000
     a1.channels.sample-channel.transactionCapacity = 100

     # Bind the source and sink to the channel
     a1.sources.r1.channels.selector.type = replicating
     a1.sources.r1.channels = sample-channel

     # Related settings Kafka, topic, and host channel where it set the source
     a1.sinks.sample.type = org.apache.flume.sink.kafka.KafkaSink
     a1.sinks.sample.topic = sample_topic
     a1.sinks.sample.brokerList = 127.0.0.1:9092
     a1.sinks.sample.requiredAcks = 1
     a1.sinks.sample.batchSize = 20
     a1.sinks.sample.channel = sample-channel

I use this command

    flume-ng agent --conf conf --conf-file /usr/local/Cellar/flume/1.9.0/libexec/conf/flume-sample.conf  -Dflume.root.logger=DEBUG,console --name a1 -Xmx512m -Xms256m 

When I read data from kafka topic

    kafka-console-consumer --topic sample_topic --from-beginning --bootstrap-server localhost:9092

I see only last 10 records from original file.

    ham Ok lor... Sony ericsson salesman... I ask shuhui then she say quite gd 2 use so i considering...
    ham Ard 6 like dat lor.
    ham Why don't you wait 'til at least wednesday to see if you get your .
    ham Huh y lei...
    spam    REMINDER FROM O2: To get 2.50 pounds free call credit and details of great offers pls reply 2 this text with your valid name, house no and postcode
    spam    This is the 2nd time we have tried 2 contact u. U have won the £750 Pound prize. 2 claim is easy, call 087187272008 NOW1! Only 10p per minute. BT-national-rate.
    ham Will ü b going to esplanade fr home?
    ham Pity, * was in mood for that. So...any other suggestions?
    ham The guy did some bitching but I acted like i'd be interested in buying something else next week and he gave it to us for free
    ham Rofl. Its true to its name

What is the proper way to see all the records?


Solution

  • You're using tail which by default shows the last 10 lines of a file.

    Instead use:

    a1.sources.r1.command = tail -c +0 -f /Users/val/Documents/code/spark/m11_to_Upload/SMSSpamCollection
    

    The -c +0 tells tail to start from the first character of the file.

    BTW an alternative is to use Kafka Connect with something like the Spooldir or File Pulse plugin.