Search code examples
hadoophivehdfssqoop

sqoop import multiple tables


We are using Cloudera CDH 4 and we are able to import tables from our Oracle databases into our HDFS warehouse as expected. The problem is we have 10's of thousands of tables inside our databases and sqoop only supports importing one table at a time.

What options are available for importing multiple tables into HDFS or Hive? For example what would be the best way of importing 200 tables from oracle into HDFS or Hive at a time?

The only solution i have seen so far is to create a sqoop job for each table import and then run them all individually. Since Hadoop is designed to work with large dataset it seems like there should be a better way though.


Solution

    1. Assuming that the sqoop configuration for each table is the same, you can list all the tables you need to import and then iterate over them launching sqoop jobs (ideally launch them asynchronously). You can run the following to fetch the list of tables from Oracle: SELECT owner, table_name FROM dba_tables reference

    2. Sqoop does offer an option to import all tables. Check this link. There are some limitations though.

    3. Modify sqoop source code and recompile it to your needs. The sqoop codebase is well documented and nicely arranged.