Search code examples
pysparkjdbcteradata

How to drop table using pyspark jdbc connector to teradata?


I can select from the teradata database, but I cannot drop using pyspark.

I have also used jaydebeapi to drop the table in the same spark session and that works. Was hoping someone may have encountered the same issue.

drop_sql = """ (DROP TABLE <DB_NAME>.<TABLENAME>) """


conn = spark.read \
.format("jdbc") \
.option("driver","com.teradata.jdbc.TeraDriver") \
.option("url","jdbc:teradata://<IP_ADDRESS>/DATABASE=. <DB_NAME>,TMODE=ANSI,CHARSET=UTF8,TYPE=FASTLOAD,LOGMECH=LDAP") \
.option("query", drop_sql) \
.option("user", user) \
.option("password",password)\
.option("fetchsize",10000).load()

ERROR:

Py4JJavaError: An error occurred while calling o265.load. : java.sql.SQLException: [Teradata Database] [TeraJDBC 17.20.00.15] [Error 3707] [SQLState 42000] Syntax error, expected something like a name or a Unicode delimited identifier or an 'UDFCALLNAME' keyword or a 'SELECT' keyword or '(' between '(' and the 'DROP' keyword.


Solution

  • spark.read provides a higher level language. It's not a Terradata driver for python.

    • Queries you pass to spark.read.format('jdbc').option(query, '...') can only contain SELECT statements.
    • Whatever you provide there is in-turn wrapped in an outer SELECT by spark code before it's sent to underlying driver for execution. E.g.
    
    spark.read.format("jdbc") \
        .option("url", jdbcUrl) \
        .option("query", "SELECT c1, c2 FROM t1") \
        .option("partitionColumn", "partiion_id") \
        .option("lowerBound", "1") \
        .option("upperBound", "300") \
        .option("numPartitions", "3") \
        .load()
    

    Would translate to 3 queries like these being executed in parallel on underlying DB. Please note real ones would be slightly different, this is curated for academic purpose:

    SELECT t.* FROM (SELECT c1, c2 FROM t1 WHERE partiion_id BETWEEN 1 AND 100) t
    SELECT t.* FROM (SELECT c1, c2 FROM t1 WHERE partiion_id BETWEEN 100 AND 200) t
    SELECT t.* FROM (SELECT c1, c2 FROM t1 WHERE partiion_id BETWEEN 100 AND 300) t
    

    So in your case Terradata is unhappy because Spark is executing something along the lines of:

    SELECT t.* FROM (DROP TABLE <DB_NAME>.<TABLENAME>) t


    What you have is not "pyspark jdbc connector to teradata". It's "Terradata JDBC driver".

    To run Terradata specific SQL on Terradata you need write python code that uses Terradata specific driver. Here is an example.

    import teradatasql
    
    with teradatasql.connect (host="whomooz", user="guest", password="please") as con:
        with con.cursor () as cur:
            try:
                sRequest = "DROP TABLE <DB_NAME>.<TABLENAME>"
                print (sRequest)
                cur.execute (sRequest)
            except Exception as ex:
                print ("Ignoring", str (ex).split ("\n") [0])
    

    If you want to run this code on Databricks/Spark-cluster then you'll have to add the jdbc driver library on the cluster in question. E.g. as a cluster library. And then run the code above on that cluster.

    I assume you've already done this given the error you're getting.