I can select from the teradata database, but I cannot drop using pyspark.
I have also used jaydebeapi to drop the table in the same spark session and that works. Was hoping someone may have encountered the same issue.
drop_sql = """ (DROP TABLE <DB_NAME>.<TABLENAME>) """
conn = spark.read \
.format("jdbc") \
.option("driver","com.teradata.jdbc.TeraDriver") \
.option("url","jdbc:teradata://<IP_ADDRESS>/DATABASE=. <DB_NAME>,TMODE=ANSI,CHARSET=UTF8,TYPE=FASTLOAD,LOGMECH=LDAP") \
.option("query", drop_sql) \
.option("user", user) \
.option("password",password)\
.option("fetchsize",10000).load()
ERROR:
Py4JJavaError: An error occurred while calling o265.load. : java.sql.SQLException: [Teradata Database] [TeraJDBC 17.20.00.15] [Error 3707] [SQLState 42000] Syntax error, expected something like a name or a Unicode delimited identifier or an 'UDFCALLNAME' keyword or a 'SELECT' keyword or '(' between '(' and the 'DROP' keyword.
spark.read
provides a higher level language. It's not a Terradata driver for python.
spark.read.format('jdbc').option(query, '...')
can only contain SELECT
statements.SELECT
by spark code before it's sent to underlying driver for execution. E.g.
spark.read.format("jdbc") \
.option("url", jdbcUrl) \
.option("query", "SELECT c1, c2 FROM t1") \
.option("partitionColumn", "partiion_id") \
.option("lowerBound", "1") \
.option("upperBound", "300") \
.option("numPartitions", "3") \
.load()
Would translate to 3 queries like these being executed in parallel on underlying DB. Please note real ones would be slightly different, this is curated for academic purpose:
SELECT t.* FROM (SELECT c1, c2 FROM t1 WHERE partiion_id BETWEEN 1 AND 100) t
SELECT t.* FROM (SELECT c1, c2 FROM t1 WHERE partiion_id BETWEEN 100 AND 200) t
SELECT t.* FROM (SELECT c1, c2 FROM t1 WHERE partiion_id BETWEEN 100 AND 300) t
So in your case Terradata is unhappy because Spark is executing something along the lines of:
SELECT t.* FROM (DROP TABLE <DB_NAME>.<TABLENAME>) t
What you have is not "pyspark jdbc connector to teradata". It's "Terradata JDBC driver".
To run Terradata specific SQL on Terradata you need write python code that uses Terradata specific driver. Here is an example.
import teradatasql
with teradatasql.connect (host="whomooz", user="guest", password="please") as con:
with con.cursor () as cur:
try:
sRequest = "DROP TABLE <DB_NAME>.<TABLENAME>"
print (sRequest)
cur.execute (sRequest)
except Exception as ex:
print ("Ignoring", str (ex).split ("\n") [0])
If you want to run this code on Databricks/Spark-cluster then you'll have to add the jdbc driver library on the cluster in question. E.g. as a cluster library. And then run the code above on that cluster.
I assume you've already done this given the error you're getting.