Search code examples
sqoop

Significance of $conditions in Sqoop


What is the significance of $conditions clause in sqoop import command?

select col1, col2 from test_table where \$CONDITIONS

Solution

  • Sqoop performs highly efficient data transfers by inheriting Hadoop’s parallelism.

    • To help Sqoop split your query into multiple chunks that can be transferred in parallel, you need to include the $CONDITIONS placeholder in the where clause of your query.

    • Sqoop will automatically substitute this placeholder with the generated conditions specifying which slice of data should be transferred by each individual task.

    • While you could skip $CONDITIONS by forcing Sqoop to run only one job using the --num-mappers 1 param‐ eter, such a limitation would have a severe performance impact.

    For example:-

    If you run a parallel import, the map tasks will execute your query with different values substituted in for $CONDITIONS. one mapper may execute "select bla from foo WHERE (id >=0 AND id < 10000)", and the next mapper may execute "select bla from foo WHERE (id >= 10000 AND id < 20000)" and so on.