AWS glueContext read doesn't allow a sql query

I want to read filtered data from a Mysql instance using AWS glue job. Since a glue jdbc connection doesnt allow me to push down predicate, I am trying to explicitly create a jdbc connection in my code.

I want to run a select query with where clause against a Mysql database using jdbc connection as shown below

import com.amazonaws.services.glue.GlueContext
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession


object TryMe {

  def main(args: Array[String]): Unit = {
    val sc: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(sc)
    val spark: SparkSession = glueContext.getSparkSession

    // Read data into a DynamicFrame using the Data Catalog metadata
    val t = glueContext.read.format("jdbc").option("url","jdbc:mysql://serverIP:port/database").option("user","username").option("password","password").option("dbtable","select * from table1 where 1=1").option("driver","com.mysql.jdbc.Driver").load()

  }
}

It fails with error

com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'select * from table1 where 1=1 WHERE 1=0' at line 1

Shouldn't this work? How do I retrieve filtered data using JDBC connection without reading the whole table into a data frame?

Solution

I think the problem occured because you didn't use the query in parentheses and provide an alias. In my opinion it should look like in the following example:

 val t = glueContext.read.format("jdbc").option("url","jdbc:mysql://serverIP:port/database").option("user","username").option("password","password").option("dbtable","(select * from table1 where 1=1) as t1").option("driver","com.mysql.jdbc.Driver").load()

More information about parameters in SQL data sources:

https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

When it comes to the Glue and the framework which the Glue provides, there is also the option "push_down_predicate", but I have only used this option on the data sources based on S3. I think it doesn't work on other sources than on S3 and non-partitioned data.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html