I want to read filtered data from a Mysql instance using AWS glue job. Since a glue jdbc connection doesnt allow me to push down predicate, I am trying to explicitly create a jdbc connection in my code.
I want to run a select query with where clause against a Mysql database using jdbc connection as shown below
import com.amazonaws.services.glue.GlueContext
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
object TryMe {
def main(args: Array[String]): Unit = {
val sc: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(sc)
val spark: SparkSession = glueContext.getSparkSession
// Read data into a DynamicFrame using the Data Catalog metadata
val t = glueContext.read.format("jdbc").option("url","jdbc:mysql://serverIP:port/database").option("user","username").option("password","password").option("dbtable","select * from table1 where 1=1").option("driver","com.mysql.jdbc.Driver").load()
}
}
It fails with error
com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'select * from table1 where 1=1 WHERE 1=0' at line 1
Shouldn't this work? How do I retrieve filtered data using JDBC connection without reading the whole table into a data frame?
I think the problem occured because you didn't use the query in parentheses and provide an alias. In my opinion it should look like in the following example:
val t = glueContext.read.format("jdbc").option("url","jdbc:mysql://serverIP:port/database").option("user","username").option("password","password").option("dbtable","(select * from table1 where 1=1) as t1").option("driver","com.mysql.jdbc.Driver").load()
More information about parameters in SQL data sources:
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
When it comes to the Glue and the framework which the Glue provides, there is also the option "push_down_predicate", but I have only used this option on the data sources based on S3. I think it doesn't work on other sources than on S3 and non-partitioned data.
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html