Search code examples
hadoophivepipelinesqoopoozie

How to create a data pipeline from hive table to relational database


Background :

I have a Hive Table "log" which contains log information. This table is loaded with new log data every hour. I want to do some quick analytics on logs for past 2 days, so i want to extract last 48 hours of data into my relational database.

To solve the above problem I have created a staging hive table which is loaded by a HIVE SQL query. After loading the new data into the staging table, i load the new logs into relational database using sqoop Query.

Problem is that sqoop is loading data into relational database in BATCH. So at any particular time i have only partial logs for a particular hour.

This is leading to erroneous analytics output.

Questions:

1). How to make this Sqoop data load transactional, i.e either all records are exported or none are exported.

2). What is best way to build this data pipeline where this whole process of Hive Table -> Staging Table -> Relational Table.

Technical Details:

Hadoop version 1.0.4
Hive- 0.9.0
Sqoop - 1.4.2


Solution

  • You should be able to do this with sqoop by using the option called --staging-table. What this does is basically act as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction. So by doing this, you shouldn't have consistency issues with partial data.

    (source: Sqoop documentation)