Ideally, when we run incremental without merge-key
it will create new file with the appended data set but if we use merge-key
then it will create new whole data set including the previous dataset in one file only. But I am not getting one part file when I use incremental append
in my sqoop job. Below are my steps:
1) Initial data:
mysql> select * from departments_per;
+---------------+-----------------+
| department_id | department_name |
+---------------+-----------------+
| 2 | Fitness |
| 3 | Footwear |
| 4 | Apparel |
| 5 | Golf |
| 6 | Outdoors |
| 7 | Fan Shop |
+---------------+-----------------+
2) sqoop command to import data into hdfs initially:
sqoop import \
--connect jdbc:mysql://localhost/practice \
--username root \
--password cloudera \
--table departments_per \
--target-dir /departments \
-m 1
Now when I see the directory departments under hdfs I can see one part file which is fine.
3) Now I update my initial data in mysql:
mysql> select * from departments_demo;
+---------------+-----------------+
| department_id | department_name |
+---------------+-----------------+
| 2 | Fitness |
| 3 | Footwear |
| 4 | Apparel |
| 5 | Golf |
| 6 | Outdoors |
| 7 | Fan |
| 8 | Tushar MC |
+---------------+-----------------+
4) Now I create incremental append job and execute it:
sqoop job --create appendJobs12 \
-- import \
--connect jdbc:mysql://localhost/practice \
--username root \
--password cloudera \
--table departments_demo \
-m 1 \
--target-dir /departments \
--incremental append \
--merge-key department_id \
--check-column department_id \
--last-value 0
sqoop job --exec appendJobs12
5) I can see two part files in hdfs directory even though I used merge-key
concept.
[cloudera@quickstart ~]$ hadoop fs -ls /departments
Found 3 items
-rw-r--r-- 1 cloudera supergroup 0 2018-10-04 00:31 /departments/_SUCCESS
-rw-r--r-- 1 cloudera supergroup 60 2018-10-04 00:31 /departments/part-m-00000
-rw-r--r-- 1 cloudera cloudera 67 2018-10-04 00:37 /departments/part-m-00001
When I display data it looks like below:
[cloudera@quickstart ~]$ hadoop fs -cat /departments/part-m-00000
2,Fitness
3,Footwear
4,Apparel
5,Golf
6,Outdoors
7,Fan Shop
[cloudera@quickstart ~]$ hadoop fs -cat /departments/part-m-00001
2,Fitness
3,Footwear
4,Apparel
5,Golf
6,Outdoors
7,Fan
8, Tushar MC
where one part file holds the initial data and the second one holds the updated data. Can anybody tell me where am I going wrong because of which I am not able to get one part file with the updated dataset. Thanks in advance
I researched this and got the same error with your approach. So, it is not correct - reading the various stuff I have to say I think it is not so clear. Anyway.
I get the impression that external tables and some LINUX scripting is required with intermediate target directories.
So, 1) I added the data in mysql and 2) did a sqoop import followed by 3) an update in mysql followed by 4) another incremental import as you did which is not really correct I think but you are wanting updates so OK, followed by 5) codegen and lastly 6) sqoop MERGE.
These are the main steps:
Initial import
sqoop import -m 1 --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table ged --warehouse-dir=/user/hive/warehouse --create-hive-table --hive-import --fields-terminated-by ',' --hive-drop-import-delims
Incremental load as you did
sqoop import -m 1 --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table ged --warehouse-dir=/user/hive/warehouse --fields-terminated-by ',' --hive-drop-import-delims --last-value 0 --merge-key id --incremental append --check-column id
Codegen
sqoop codegen --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table ged
Merge
sqoop merge --new-data /user/hive/warehouse/ged --onto /user/hive/warehouse/new_ged --merge-key id --target-dir /user/hive/merged/ged2 --jar-file /tmp/sqoop-cloudera/compile/c8d374075351d228c50d89354959762e/ged.jar -class-name ged
Results: [cloudera@quickstart ~]$ hadoop fs -cat /user/hive/merged/ged2/* 1,YYY 2,Peter 3,Bobby 4,Maria 5,Joke 6,Joker
whereby I originally had 1, XXX but not 6,Joker
This is somewhat different, so I am not sure what to state. In any event 1 file is not a valid assumption with lots of data. Your statement may work with extra parameters but this works as well.
The clue here is updating of immutable systems via merge requiring different targets that can be switched via external table commands in terms of locations.