Streaming live data from HDFS to Hive

I am new to Hadoop ecosystem and self learning it through online articles. I am working on very basic project so that I can get hands-on on what I have learnt.

My use-case is extremely: Idea is I want to present location of user who login to portal to app admin.So, I have a server which is continuously generating logs, logs have user id, IP address, time-stamp. All fields are comma separated.

My idea to do this is to have a flume agent to streaming live logs data and write to HDFS. Have HIVE process in place which will read incremental data from HDFS and write to HIVE table. Use scoop to continuously copy data from HIVE to RDMBS SQL table and use that SQL table to play with. So far I have successfully configured flume agent which read logs from a given location and write to hdfs location. But after this I am confused as how should I move data from HDFS to HIVE table. One idea that's coming to my mind is to have a MapRed program that will read files in HDFS and write to HIVE tables programatically in Java. But I also want to delete files which are already processed and make sure that no duplicate records are read by MapRed. I searched online and found command that can be used to copy file data to HIVE but that's sort of a manual once activity. In my usecase I want to push data as soon as it's available in HDFS. Please guide me how to achieve this task. Links will be helpful.

I am working on Version: Cloudera Express 5.13.0

Update 1: I just created an external HIVE table pointing to HDFS location where flume is dumping logs. I noticed that as soon as table is created, I can query HIVE table and fetch data. This is awesome. But what will happen if I stop flume agent for time being, let app server to write logs, now if I start flume again then will flume only read new logs and ignore logs which are already processed? Similarly, will hive read new logs which are not processed and ignore the ones which it has already processed?

Solution

how should I move data from HDFS to HIVE table

This isn't how Hive works. Hive is a metadata layer over existing HDFS storage. In Hive, you would define an EXTERNAL TABLE, over wherever Flume writes your data to.

As data arrives, Hive "automatically knows" that there is new data to be queried (since it reads all files under the given path)

what will happen if I stop flume agent for time being, let app server to write logs, now if I start flume again then will flume only read new logs and ignore logs which are already processed

Depends how you've setup Flume. AFAIK, it will checkpoint all processed files, and only pick up new ones.

will hive read new logs which are not processed and ignore the ones which it has already processed?

Hive has no concept of unprocessed records. All files in the table location will always be read, limited by your query conditions, upon each new query.

Bonus: Remove Flume and Scoop. Make your app produce records into Kafka. Have Kafka Connect (or NiFi) write to both HDFS and your RDBMS from a single location (Kafka topic). If you actually need to read log files, Filebeat or Fluentd take less resources than Flume (or Logstash)

Bonus 2: Remove HDFS & RDBMS and instead use a more real-time ingestion pipeline like Druid or Elasticsearch for analytics.

Bonus 3: Presto / SparkSQL / Flink-SQL are faster than Hive (note: the Hive metastore is actually useful, so keep the RDBMS around for that)