Search code examples
sqlpysparkjdbcamazon-redshift

Pass parameters to query with spark.read.format(jdbc) format


I am trying to execute following sample query throug spark.read.format("jdbc") in redshift

query="select * from tableA join tableB on a.id = b.id where a.date > ? and b.date > ?"

I am trying to execute this query as follows:

I read query_parameters from a table and I am getting a dataframe and then made it as a list by using df.collect() statement and then converting to tuple to pass it as parameters

query_parameters = ("2022-01-01","2023-01-01")

df = spark.read.format("jdbc") \
  .option("url", url) \
  .option("query",query) \
  .option("user","user") \
  .option("password", pass) \
  .option("queryParameters", ",".join(query_parameters)) \
  .option("driver", driver) \
  .load()

I am getting following error:

com.amazon.redshift.util.RedshiftException: ERROR: invalid input syntax for type date: \"?\"\r\n\tat

How do we pass parameters to the query when we are executing through spark.read in pyspark


Solution

  • According to official spark-redshift implementation, it seems that there is no option named queryParameters available. Not sure where you found it, but I wasn't able to find it in the official github code.

    The only way to pass parameters to your query is through Python string concatenation or interpolation and setting the query option of the driver, as shown in the corresponding test suite RedshiftReadSuite.scala.

    For the above example this should work:

    query_parameters = ("2022-01-01","2023-01-01")
    
    query=f"select * from tableA join tableB on a.id = b.id where a.date > '{query_parameters[0]}'' and b.date > '{query_parameters[1]}'"
    
    df = spark.read.format("jdbc") \
      .option("url", url) \
      .option("query",query) \
      .option("user","user") \
      .option("password", pass) \
      .option("driver", driver) \
      .load()