Search code examples
javapysparkjdbcnetsuite

PySpark JDBC connection to NetSuite2.com fails with 'Failed to login using TBA' error in Databricks


I'm encountering an issue while trying to connect to NetSuite2.com using PySpark from a Databricks Notebook with a JDBC driver. Despite successfully setting up a DBVisualizer connection using the same credentials, I'm receiving an authentication error when attempting to connect via PySpark.

Environment Details

  • Databricks Runtime Version: 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)
  • NetSuite JDBC Driver Version: 2021.2
  • PySpark Version: 3.2.1

What I've Done So Far

  1. Successfully connected to NetSuite using DBVisualizer by installing the JDBC Driver (JAR) and generating the password with the one-time hashing NONCE.
  2. Attempted to use the same URL, JAR, and Password (utilizing NONCE) in PySpark within Databricks.

The Error

When running the PySpark code, I encounter the following error:

SparkException: Job aborted due to stage failure: Task 0 in stage 36.0 failed 4 times, most recent failure: Lost task 0.3 in stage 36.0 (TID ***) (*** executor 1): java.sql.SQLException: [NetSuite][SuiteAnalytics Connect JDBC Driver][OpenAccess SDK SQL Engine]Failed to login using TBA. Error ticket# m3z*********cp2f[232]

My Code

Here's the PySpark code I'm using to establish the connection:

pwd = generate_tba_password(account_id, consumer_key, token_id, consumer_secret, token_secret)

# Construct the JDBC URL
jdbc_url = f"jdbc:ns://{host}:{port};ServerDataSource=NetSuite2.com;Encrypted=1;NegotiateSSLClose=false;" \
           f"CustomProperties=(AccountID={account_id};RoleID={role_id};AuthType=TOKEN;" \
           f"TokenID={token_id};TokenSecret={token_secret};ConsumerKey={consumer_key};ConsumerSecret={consumer_secret})"

# Read data from NetSuite
df = spark.read.format("jdbc") \
    .option("url", jdbc_url) \
    .option("User", "TBA") \
    .option("Password", pwd) \
    .option("driver", "com.netsuite.jdbc.openaccess.OpenAccessDriver") \
    .option("dbtable", "customer") \
    .load()

df.show()

What I've Tried

  1. Verified that the generate_tba_password function is producing the correct password format.
  2. Tried different combinations in the CustomProperties string, including:
    • Removing the RoleID
    • Changing AuthType to "NLAuth" instead of "TOKEN"
  3. Ensured the JDBC driver JAR is properly installed in the Databricks environment.

Specific Questions

  1. Is the JDBC URL construction correct for NetSuite when using PySpark?
  2. Are there any known issues with Token-Based Authentication (TBA) in PySpark for NetSuite connections?
  3. Are there any Databricks-specific configurations needed for NetSuite JDBC connections that differ from standard JDBC connections?
  4. Could there be an issue with how Databricks is handling the NONCE-based password generation?

Any insights or suggestions on how to resolve this authentication error would be greatly appreciated. Thank you in advance for your help!


Solution

  • I found the answer to my question, and will record it here for reference. You cannot use spark when authenticating via Token-Based-Authentication (TBA). The reason is that the cluster runs two connections. First, to extract the metadata. Second, to extract the data for the query. However, spark reuses the same connection string to make both connections, which is a violation of NONCE (use only once) password and there is no way to have spark instead generate that password for every connection under the hood.

    For instance, the query above first extracts the metadata related to "customer" table. Columns, data types, etc. Then a second connection is attempted to take this metadata and extract all the data from the table.

    The way I solved for this was to change my authentication from TBA to username/password with rotation every 'n' days. It is secure method as long as it is managed properly, such as storing it as a secret in Databricks so even the user cannot see the passwords.

    One more thing to beware. Because Netsuite is a legacy system with a legacy api, you cannot have concurrent connections. This means that all reads MUST be single threaded. In spark, the way to do this is to set the option "numPartitions" to "1". And if you need to process the dataframe or display it, you should cache() it immediately after reading and prior to processing. Else, your code will fail due to concurrency trying to read the data from source in parallel. Cache writes the data to memory, after which all processing in spark will be on the cached dataframe not the actual source itself. Alternatively, but not recommended, is to write the data to some storage location of choice. Be sure to unpersist() the dataframe after your processing is complete.

    The following revised code works:

    DataSource = "NetSuite2.com"
    AccountId = <your account id>
    ServerHost = f"{AccountId}.connect.api.netsuite.com"
    RoleId = <your role id>
    Port = <your port>
    user = dbutils.secrets.get(scope = <your defined user scope>, key = <your defined user key>)
    password = dbutils.secrets.get(scope = <your defined pwd scope>, key = <your defined pwd key>)
    url = f"jdbc:ns://{ServerHost}:{Port};ServerDataSource={DataSource};encrypted=1;NegotiateSSLClose=false;CustomProperties=(AccountID={AccountId};RoleID={RoleId})"
    driver = "com.netsuite.jdbc.openaccess.OpenAccessDriver"
    
    table_df = spark.read.format ( "jdbc" ) \
                        .option ( "driver" , driver) \
                        .option ( "url" , url) \
                        .option ( "user" , user) \
                        .option ( "password", password) \
                        .option ( "dbtable" , "customer") \
                        .option ( "numPartitions" , "1") \
                        .load ()
    
            table_df.cache()
            table_df.show()
            # ... other processing steps ... #
            table_df.unpersist()