Search code examples
hadoophivesqoop

incremental sqoop to HIVE table


It is known that --incremental sqoop import switch doesn't work for HIVE import through SQOOP. But what is the workaround for that?

1)One thing I could make up is that we can create a HIVE table, and bring incremental data to HDFS through SQOOP, and then manually load them. but if we are doing it , each time do that load, the data would be overwritten. Please correct me if I am wrong.

2) How effective --query is when sqooping data to HIVE?

Thank you


Solution

  • You can do the sqoop incremental append to the hive table, but there is no straight option, below is one of the way you can achieve it.

    Store the incremental table as an external table in Hive. It is more common to be importing incremental changes since the last time data was updated and then merging it.In the following example, --check-column is used to fetch records newer than last_import_date, which is the date of the last incremental data update:

    sqoop import --connect jdbc:teradata://{host name}/Database=retail —connection manager org.apache.sqoop.teradata.TeradataConnManager --username dbc -password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table -m 1 --check-column modified_date --incremental lastmodified --last-value {last_import_date}
    

    second part of your question

    Query is also a very useful argument you can leverage in swoop import, that will give you the flexibility of basic joins on the rdbms table and flexibility to play with the date and time formats. If I were in your shoes I would do this, using the query I will import the data in the way I need and than I will append it to my original table and while loading from temporary to main table I can play more with the data. I would suggest using query if the updates are not too frequent.