MySQL table with dept_id as primary key
|dept_id | dept_name |
| 2 | Fitness
| 3 | Footwear
| 4 | Apparel
| 5 | Golf
| 6 | Outdoors
| 7 | Fan Shop
Sqoop Query
sqoop import \
-m 2 \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
-P \
--query 'select * from departments where dept_id < 6 AND $CONDITIONS' \
--target-dir /user/cloudera/sqoop_import/departments;
Results with an error on console:
When importing query results in parallel, you must specify
--split-by
---Question point!---
Even though the table has primary key & the splits can be equally distributed between 2 mappers then what is the need of --spit-by or -m 1 ??
Guide me for the same.
Thanks.
It is not primary key with --split-by
usage. you are seeing error because of use of --query
option. This option MUST be used with --split-by
, --target-dir
and $CONDITIONS
in query.
free_form_query_imports documentations
When importing a free-form query, you must specify a destination directory with --target-dir.
If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.
You can use --where
option if you don't want to use --split-by
and --query
:
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
-P \
--table departments \
--target-dir /user/cloudera/departments \
-m 2 \
--where "department_id < 6"
if you use --boundary-query
option then you don't need --split-by
, --query
option:
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
-P \
--table departments \
--target-dir /user/cloudera/departments \
-m 2 \
--boundary-query "select 2, 6 from departments limit 1"
By default sqoop will use query
select min(<split-by>), max(<split-by>) from <table name>
to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.