I am trying to execute following sample query throug spark.read.format("jdbc") in redshift
query="select * from tableA join tableB on a.id = b.id where a.date > ? and b.date > ?"
I am trying to execute this query as follows:
I read query_parameters from a table and I am getting a dataframe and then made it as a list by using df.collect() statement and then converting to tuple to pass it as parameters
query_parameters = ("2022-01-01","2023-01-01")
df = spark.read.format("jdbc") \
.option("url", url) \
.option("query",query) \
.option("user","user") \
.option("password", pass) \
.option("queryParameters", ",".join(query_parameters)) \
.option("driver", driver) \
.load()
I am getting following error:
com.amazon.redshift.util.RedshiftException: ERROR: invalid input syntax for type date: \"?\"\r\n\tat
How do we pass parameters to the query when we are executing through spark.read in pyspark
According to official spark-redshift implementation, it seems that there is no option named queryParameters available. Not sure where you found it, but I wasn't able to find it in the official github code.
The only way to pass parameters to your query is through Python string concatenation or interpolation and setting the query
option of the driver, as shown in the corresponding test suite RedshiftReadSuite.scala.
For the above example this should work:
query_parameters = ("2022-01-01","2023-01-01")
query=f"select * from tableA join tableB on a.id = b.id where a.date > '{query_parameters[0]}'' and b.date > '{query_parameters[1]}'"
df = spark.read.format("jdbc") \
.option("url", url) \
.option("query",query) \
.option("user","user") \
.option("password", pass) \
.option("driver", driver) \
.load()