Search code examples
mysqlhadoopbigdatasqoop

sqoop merge-key creating multiple part files instead of one which doesn't serve the purpose of using merge-key


Ideally, when we run incremental without merge-key it will create new file with the appended data set but if we use merge-key then it will create new whole data set including the previous dataset in one file only. But I am not getting one part file when I use incremental append in my sqoop job. Below are my steps:

1) Initial data:

mysql> select * from departments_per;
+---------------+-----------------+
| department_id | department_name |
+---------------+-----------------+
|             2 | Fitness         |
|             3 | Footwear        |
|             4 | Apparel         |
|             5 | Golf            |
|             6 | Outdoors        |
|             7 | Fan Shop        |
+---------------+-----------------+  

2) sqoop command to import data into hdfs initially:

sqoop import \
--connect jdbc:mysql://localhost/practice \
--username root \
--password cloudera \
--table departments_per \
--target-dir /departments \
-m 1

Now when I see the directory departments under hdfs I can see one part file which is fine.

3) Now I update my initial data in mysql:

mysql> select * from departments_demo;
+---------------+-----------------+
| department_id | department_name |
+---------------+-----------------+
|             2 | Fitness         |
|             3 | Footwear        |
|             4 | Apparel         |
|             5 | Golf            |
|             6 | Outdoors        |
|             7 | Fan             |
|             8 | Tushar MC       |
+---------------+-----------------+

4) Now I create incremental append job and execute it:

sqoop job --create appendJobs12 \
-- import \
--connect jdbc:mysql://localhost/practice \
--username root \
--password cloudera \
--table departments_demo \
-m 1 \
--target-dir /departments \
--incremental append \
--merge-key department_id \
--check-column department_id \
--last-value 0 

sqoop job --exec appendJobs12   

5) I can see two part files in hdfs directory even though I used merge-key concept.

[cloudera@quickstart ~]$ hadoop fs -ls /departments
Found 3 items
-rw-r--r--   1 cloudera supergroup          0 2018-10-04 00:31 /departments/_SUCCESS
-rw-r--r--   1 cloudera supergroup         60 2018-10-04 00:31 /departments/part-m-00000
-rw-r--r--   1 cloudera cloudera           67 2018-10-04 00:37 /departments/part-m-00001  

When I display data it looks like below:

[cloudera@quickstart ~]$ hadoop fs -cat /departments/part-m-00000
2,Fitness
3,Footwear
4,Apparel
5,Golf
6,Outdoors
7,Fan Shop 
[cloudera@quickstart ~]$ hadoop fs -cat /departments/part-m-00001
    2,Fitness
    3,Footwear
    4,Apparel
    5,Golf
    6,Outdoors
    7,Fan
    8, Tushar MC

where one part file holds the initial data and the second one holds the updated data. Can anybody tell me where am I going wrong because of which I am not able to get one part file with the updated dataset. Thanks in advance


Solution

  • I researched this and got the same error with your approach. So, it is not correct - reading the various stuff I have to say I think it is not so clear. Anyway.

    I get the impression that external tables and some LINUX scripting is required with intermediate target directories.

    So, 1) I added the data in mysql and 2) did a sqoop import followed by 3) an update in mysql followed by 4) another incremental import as you did which is not really correct I think but you are wanting updates so OK, followed by 5) codegen and lastly 6) sqoop MERGE.

    These are the main steps:

    Initial import

    sqoop import -m 1 --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table ged  --warehouse-dir=/user/hive/warehouse --create-hive-table --hive-import --fields-terminated-by ',' --hive-drop-import-delims
    

    Incremental load as you did

    sqoop import -m 1 --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table ged  --warehouse-dir=/user/hive/warehouse --fields-terminated-by ',' --hive-drop-import-delims --last-value 0 --merge-key id --incremental append --check-column id
    

    Codegen

     sqoop codegen --connect jdbc:mysql://quickstart:3306/retail_db --username=retail_dba --password=cloudera --table ged
    

    Merge

     sqoop merge --new-data /user/hive/warehouse/ged --onto /user/hive/warehouse/new_ged --merge-key id --target-dir /user/hive/merged/ged2 --jar-file /tmp/sqoop-cloudera/compile/c8d374075351d228c50d89354959762e/ged.jar -class-name ged
    

    Results: [cloudera@quickstart ~]$ hadoop fs -cat /user/hive/merged/ged2/* 1,YYY 2,Peter 3,Bobby 4,Maria 5,Joke 6,Joker

    whereby I originally had 1, XXX but not 6,Joker

    This is somewhat different, so I am not sure what to state. In any event 1 file is not a valid assumption with lots of data. Your statement may work with extra parameters but this works as well.

    The clue here is updating of immutable systems via merge requiring different targets that can be switched via external table commands in terms of locations.