Search code examples
logginghadoophiveflumeavro

Log Data using Flume Avro not stored properly in Hive


Im using Flume 1.5.0 to collect log from Application server. Say i have three App server, App-A, App-B, App-C. One HDFS Server where hive is running. Now flume agents are running on all 3 App server and passing the log message from app servers to Hdfs server,where another flume agent is running and finaaly the logs are stored in hadoop file system. Now I have created an external Hive table to map those log data. But everything is working smoothly except the fact that hive is unable to parse the log data properly and store in table.

Here's my Flume and Hive configuration:

Dummy Log File Format (| separated): ClientId|App Request|URL

Flume conf at App servers:

app-agent.sources = tail
app-agent.channels = memoryChannel 
app-agent.sinks = avro-forward-sink 

app-agent.sources.tail.type = exec 
app-agent.sources.tail.command = tail -F /home/kuntal/practice/testing/application.log
app-agent.sources.tail.channels = memoryChannel


app-agent.channels.memoryChannel.type = memory
app-agent.channels.memoryChannel.capacity = 100000
app-agent.channels.memoryChannel.transactioncapacity = 10000

app-agent.sinks.avro-forward-sink.type = avro 
app-agent.sinks.avro-forward-sink.hostname = localhost
app-agent.sinks.avro-forward-sink.port = 10000
app-agent.sinks.avro-forward-sink.channel = memoryChannel

Flume conf at Hdfs server:

hdfs-agent.sources = avro-collect
hdfs-agent.channels = memoryChannel 
hdfs-agent.sinks = hdfs-write 

hdfs-agent.sources.avro-collect.type = avro 
hdfs-agent.sources.avro-collect.bind = localhost
hdfs-agent.sources.avro-collect.port = 10000 
hdfs-agent.sources.avro-collect.channels = memoryChannel

hdfs-agent.channels.memoryChannel.type = memory
hdfs-agent.channels.memoryChannel.capacity = 100000
hdfs-agent.channels.memoryChannel.transactioncapacity = 10000

hdfs-agent.sinks.hdfs-write.channel = memoryChannel
hdfs-agent.sinks.hdfs-write.type = hdfs 
hdfs-agent.sinks.hdfs-write.hdfs.path = hdfs://localhost:9000/user/flume/tail_table/avro
hdfs-agent.sinks.hdfs-write.rollInterval = 30 

Hive external table:

CREATE EXTERNAL TABLE IF NOT EXISTS test(clientId int, itemType string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
LOCATION '/user/flume/tail_table/avro';

Please suggest me what to do? Do i need to include AvroSerde at hive side?


Solution

  • Missing the following 3 additional settings in the hdfs sink :

    hdfs-agent.sinks.hdfs-write.hdfs.fileType = DataStream
    hdfs-agent.sinks.hdfs-write.hdfs.writeFormat = Text
    hdfs-agent.sinks.hdfs-write.hdfs.rollInterval = 30 
    

    Hence data was not properly stored in hdfs and Hive unable to load into table.Now its working fine!