Observing duplicates using sqoop with Oozie

I've built a sqoop pogram in order to import data from MySQL to HDFS using a pre-built sqoop job:

                sqoop job -fs $driver_path -D mapreduce.map.java.opts=" -Duser.timezone=Europe/Paris"\
            --create job_parquet_table -- import -m $nodes_number\
            --connect jdbc:mysql://$server:$port/$database --username $username --password-file $pass_file\
            --target-dir $destination_dir --table $table --as-parquetfile --append\
            --incremental append --check-column $id_column_names --last-value 1\
            --fields-terminated-by , --escaped-by \\ --enclosed-by '\"'

I've also builted a spark progam which permits to monnitore the sqoop importation. In this, I have measured the max_id (SELECT Max(ID) of the table which is being imported, the sqoop incremental.last.value, the number of records ( SELECT count()), the number of distinct records ( SELECT count(distinct ID)) and the percent of duplicates (defined as SELECT count() - count(distinct ID) / count(*) * 100).

When I import the data manually, by executing the sqoop job:

sqoop-job --exec job_parquet_$table -- --username edastatnew --password edastatnew

Everything is right, I've got no duplicates and data is perfectly imported

However, when I use Oozie to schedule the sqoop job, I have observed that sqoopincremental.last.value is wrongly updated, what causes duplicates in my data. Here is my monitoring:

As you can see, the max_ID doesn't match the sqoop_last_value at different time of the experiment. I have concluded that when using Oozie, the sqoop incremenatal.last.value is wrongly updated after executing a sqoop job. Have someone faced this same issue? if yes, have you find any solution?

Thanks,

Solution

Ask yourself a question: where does Sqoop store that "last value" information?

The answer is: for Sqoop1, by default, in a file on the local filesystem. But Oozie runs your Sqoop job on random machines therefore the executions are not coordinated.
And Sqoop2 (which has a proper Metastore database) is more or less in limbo; at least it is not supported by Oozie.

The solution is to start a shared HSQLDB database service to store the "last value" information for all Sqoop1 jobs, whatever machine they are running on.

Please read the Sqoop1 documentation about its lame Metastore and about how to use it, from there to there.
And for a more professional handling of that obsolete HSQLDB database, look at that post of mine.