I have table in MySQL
I want to import data using sqoop
. I imported the data and stored in HDFS
as a file
. Now I want to run incremental update on this
file in hdfs.
Say I have 100 rows in MYSQL table. The file in HDFS contains data worth of first 50 rows. How can I incremental update this file.
I am talking about files not Hive tables
.
I want incremental data as a separate file not merged file. For example the first part file contains 50 records, then I need a part file that contains the next 50 records. I mean to say can we do incremental update on files?
You can not update HDFS file in this case.
But this is a common use-case. sqoop-merge
tool can solve this. You need to perform sqoop incremental import and save output in different hdfs files.
As per documentation,
The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key.
Sample Command:
sqoop merge --new-data newer --onto older --target-dir merged \
--jar-file datatypes.jar --class-name Foo --merge-key id