Search code examples
hadoopmergehdfssqoop

Is it possible to write a Sqoop incremental import with filters on the new file before importing?


My doubt is, Say, I have a file A1.csv with 2000 records on sql-server table, I import this data into hdfs, later that day I have added 3000 records to the same file on sql-server table. Now, I want to run incremental import for the second chunk of data to be added on hdfs, but, I do not want complete 3000 records to be imported. I need only some data according to my necessity to be imported, like, 1000 records with certain condition to be imported as part of the increment import.

Is there a way to do that using sqoop incremental import command?

Please Help, Thank you.


Solution

  • You need a unique key or a Timestamp field to identify the deltas which is the new 1000 records in your case. using that field you have to options to bring in the data to Hadoop.


    Option 1

    is to use the sqoop incremental append, below is the example of it

    sqoop import \
    --connect jdbc:oracle:thin:@enkx3-scan:1521:dbm2 \
    --username wzhou \
    --password wzhou \
    --table STUDENT \
    --incremental append \
    --check-column student_id \
    -m 4 \
    --split-by major
    

    Arguments :

    --check-column (col)  #Specifies the column to be examined when determining which rows to import.
    
    --incremental (mode)      #Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.
    
    --last-value (value) Specifies the maximum value of the check column from the previous import.
    

    Option 2

    Using the --query argument in sqoop where you can use the native sql for mysql/any database you connect to.

    Example :

    sqoop import \
      --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
      --split-by a.id --target-dir /user/foo/joinresults
    
    sqoop import \
      --query 'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) WHERE $CONDITIONS' \
      -m 1 --target-dir /user/foo/joinresults