hadoop hdfs sqoop hadoop2 cloudera-quickstart-vm

Sqoop --split-by error while importing despite of having primary key in table

MySQL table with dept_id as primary key

|dept_id | dept_name |  
| 2 | Fitness   
| 3 | Footwear  
| 4 | Apparel   
| 5 | Golf      
| 6 | Outdoors  
| 7 | Fan Shop

Sqoop Query

sqoop import \  
-m 2 \  
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \  
--username retail_dba \  
-P \  
--query 'select * from departments where dept_id < 6  AND $CONDITIONS' \      
--target-dir /user/cloudera/sqoop_import/departments;

Results with an error on console:

When importing query results in parallel, you must specify --split-by

---Question point!---
Even though the table has primary key & the splits can be equally distributed between 2 mappers then what is the need of --spit-by or -m 1 ??

Guide me for the same.
Thanks.

Solution

It is not primary key with --split-by usage. you are seeing error because of use of --query option. This option MUST be used with --split-by, --target-dir and $CONDITIONS in query.

free_form_query_imports documentations

When importing a free-form query, you must specify a destination directory with --target-dir.

If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.

You can use --where option if you don't want to use --split-by and --query:

sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  -P \
  --table departments \
  --target-dir /user/cloudera/departments \
  -m 2 \
  --where "department_id < 6"

if you use --boundary-query option then you don't need --split-by, --query option:

sqoop import \
  --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
  --username=retail_dba \
  -P \
  --table departments \
  --target-dir /user/cloudera/departments \
  -m 2 \
  --boundary-query "select 2, 6 from departments limit 1"

selecting_the_data_to_import

By default sqoop will use query select min(<split-by>), max(<split-by>) from <table name> to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.