Search code examples
hadoophdfsflumeflume-ng

Flume - Is there a way to store avro event (header & body) into hdfs?


New to flume...

I'm receiving avro events and storing them into HDFS.

I understand that by default only the body of the event is stored in HDFS. I also know there is an avro_event serializer. But I do not know what this serializer is actually doing? How does it effect the final output of the sink?

Also, I can't figure out how to just dump the event into HDFS preserving its header information. Do I need to write my own serializer?


Solution

  • As it turns out the serializer avro_event does store both header & body in the file.

    Here is how I set up my sink:

    a1.sinks.i1.type=hdfs
    a1.sinks.i1.hdfs.path=hdfs://localhost:8020/user/my-name
    a1.sinks.i1.hdfs.rollInterval=0
    a1.sinks.i1.hdfs.rollSize=1024
    a1.sinks.i1.hdfs.rollCount=0
    a1.sinks.i1.serializer=avro_event
    a1.sinks.i1.hdfs.fileType=DataStream
    

    I sent the events using the packaged agent avro-client, injected headers by using the -R headerFile option.

    content of headerFile:

    machine=localhost
    user=myName
    

    Finally tested the results using a simple java app I stole from this posting:

    final FileSystem fs = FileSystem.get(getConf());
            final Path path = new Path(fs.getHomeDirectory(), "FlumeData.1446072877536");
    
            printWriter.write(path + "-exists: " + fs.exists(path));
    
            final SeekableInput input = new FsInput(path, getConf());
            final DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
            final FileReader<GenericRecord> fileReader = DataFileReader.openReader(input, reader);
    
            for (final GenericRecord datum : fileReader) {
                printWriter.write("value = " + datum);
            }
    
            fileReader.close(); 
    

    And sure enough I see my headers for each record, here is one line:

    value = {"headers": {"machine": "localhost", "user": "myName"}, "body": {"bytes": "set -x"}}
    

    There is one other serializer that also emits the headers and that is the header_and_text serializer The resulting file is a human-readable text file. Here is a sample line:

    {machine=localhost, user=userName} set -x
    

    Finally in the Apache Flume - Distributed Log Collection for Hadoop, there is a mention of the header_and_text serialzer but I couldn't get that to work.