Search code examples
sqoop

Valid MySQL query breaks when used as boundary-query


Note: This is NOT a duplicate of Sqoop - Syntaxt error - Boundary Query - “error in your SQL syntax”


To limit the fetching data from only last 8 days, I'm using this following boundary-query with Sqoop

SELECT min(`created_at`),
       max(`created_at`)
FROM `billing_db`.`billing_ledger`
WHERE `created_at` >= timestamp(date(convert_tz(now(), IF(@@global.time_zone = 'SYSTEM', @@system_time_zone, @@global.time_zone),'Asia/Kolkata')) + interval -2 DAY)"

I've broken query into multiple lines here for readability, actually i pass it to Sqoop in single line only

Explaination of different parts of boundary-query are

  1. IF(@@global.time_zone = 'SYSTEM', @@system_time_zone, @@global.time_zone)
    • determines server timezone
    • works for both MySQL & TiDB
  2. convert_tz(now(), <server-timezone>,'Asia/Kolkata')
    • converts time from server-timezone in IST
  3. timestamp(date(<ist-timestamp> + interval -{num_days} DAY)
    • returns the IST timestamp at 00:00 hours for date whih is {num_days} before today (current-time -> tz-specific)

While the query works fine on MySQL

mysql> SELECT min(`created_at`),
    ->        max(`created_at`)
    -> FROM `billing_db`.`billing_ledger`
    -> WHERE `created_at` >= timestamp(date(convert_tz(now(), IF(@@global.time_zone = 'SYSTEM', @@system_time_zone, @@global.time_zone),'Asia/Kolkata')) + interval -2 DAY);
+---------------------+---------------------+
| min(`created_at`)   | max(`created_at`)   |
+---------------------+---------------------+
| 2020-05-08 00:00:00 | 2020-05-10 20:12:32 |
+---------------------+---------------------+
1 row in set (0.02 sec)

It breaks with following stacktrace on Sqoop

INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT min(), max() FROM . WHERE  >= timestamp(date(convert_tz(now(), IF(@@global.time_zone = 'SYSTEM', @@system_time_zone, @@global.time_zone),'Asia/Kolkata')) + interval -2 DAY)
[2020-05-10 12:45:34,968] {ssh_utils.py:130} WARNING - 20/05/10 18:15:34 INFO mapreduce.JobSubmitter: Cleaning up the staging area /tmp/hadoop-yarn/staging/hadoop/.staging/job_1589114450995_0001
[2020-05-10 12:45:34,971] {ssh_utils.py:130} WARNING - 20/05/10 18:15:34 DEBUG util.ClassLoaderStack: Restoring classloader: sun.misc.Launcher$AppClassLoader@6ab7a896
[2020-05-10 12:45:34,973] {ssh_utils.py:130} WARNING - 20/05/10 18:15:34 ERROR tool.ImportTool: Import failed: java.io.IOException: java.sql.SQLSyntaxErrorException: (conn=313686) You have an error in your SQL syntax; check the manual that corresponds to your TiDB version for the right syntax to use line 1 column 12 near "), max() FROM . WHERE  >= timestamp(date(convert_tz(now(), IF(@@global.time_zone = 'SYSTEM', @@system_time_zone, @@global.time_zone),'Asia/Kolkata')) + interval -2 DAY)" 
    at org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:207)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:303)

For the record

  • using WHERE $CONDITIONS is required in --query (free-form query-import) but for --boundary-query it is NOT mandatory. Without it, Sqoop merely generates this warning

    WARN db.DataDrivenDBInputFormat: Could not find $CONDITIONS token in query: SELECT min(), max() FROM . WHERE >= timestamp(date(convert_tz(now(), IF(@@global.time_zone = 'SYSTEM', @@system_time_zone, @@global.time_zone),'Asia/Kolkata')) + interval -2 DAY); splits may not partition data.

  • I've been using similar complex boundary-querys elsewhere in my pipeline but in this particular case it is breaking


What have I tried

  • I tried adding aliases in SELECT clause of query like this

    SELECT min(`created_at`) AS min_created_at,...


Solution

  • Backticks `` were the culprit

    Removing backticks from boundary-query resolved the error

    • Some comments in discussions point out that backticks can cause wierd things with sqoop
    • But the docs bear no mention of it and some discussions even encourage using it