I'm encountering an issue while trying to connect to NetSuite2.com using PySpark from a Databricks Notebook with a JDBC driver. Despite successfully setting up a DBVisualizer connection using the same credentials, I'm receiving an authentication error when attempting to connect via PySpark.
When running the PySpark code, I encounter the following error:
SparkException: Job aborted due to stage failure: Task 0 in stage 36.0 failed 4 times, most recent failure: Lost task 0.3 in stage 36.0 (TID ***) (*** executor 1): java.sql.SQLException: [NetSuite][SuiteAnalytics Connect JDBC Driver][OpenAccess SDK SQL Engine]Failed to login using TBA. Error ticket# m3z*********cp2f[232]
Here's the PySpark code I'm using to establish the connection:
pwd = generate_tba_password(account_id, consumer_key, token_id, consumer_secret, token_secret)
# Construct the JDBC URL
jdbc_url = f"jdbc:ns://{host}:{port};ServerDataSource=NetSuite2.com;Encrypted=1;NegotiateSSLClose=false;" \
f"CustomProperties=(AccountID={account_id};RoleID={role_id};AuthType=TOKEN;" \
f"TokenID={token_id};TokenSecret={token_secret};ConsumerKey={consumer_key};ConsumerSecret={consumer_secret})"
# Read data from NetSuite
df = spark.read.format("jdbc") \
.option("url", jdbc_url) \
.option("User", "TBA") \
.option("Password", pwd) \
.option("driver", "com.netsuite.jdbc.openaccess.OpenAccessDriver") \
.option("dbtable", "customer") \
.load()
df.show()
generate_tba_password
function is producing the correct password format.Any insights or suggestions on how to resolve this authentication error would be greatly appreciated. Thank you in advance for your help!
I found the answer to my question, and will record it here for reference. You cannot use spark when authenticating via Token-Based-Authentication (TBA). The reason is that the cluster runs two connections. First, to extract the metadata. Second, to extract the data for the query. However, spark reuses the same connection string to make both connections, which is a violation of NONCE (use only once) password and there is no way to have spark instead generate that password for every connection under the hood.
For instance, the query above first extracts the metadata related to "customer" table. Columns, data types, etc. Then a second connection is attempted to take this metadata and extract all the data from the table.
The way I solved for this was to change my authentication from TBA to username/password with rotation every 'n' days. It is secure method as long as it is managed properly, such as storing it as a secret in Databricks so even the user cannot see the passwords.
One more thing to beware. Because Netsuite is a legacy system with a legacy api, you cannot have concurrent connections. This means that all reads MUST be single threaded. In spark, the way to do this is to set the option "numPartitions" to "1". And if you need to process the dataframe or display it, you should cache() it immediately after reading and prior to processing. Else, your code will fail due to concurrency trying to read the data from source in parallel. Cache writes the data to memory, after which all processing in spark will be on the cached dataframe not the actual source itself. Alternatively, but not recommended, is to write the data to some storage location of choice. Be sure to unpersist() the dataframe after your processing is complete.
The following revised code works:
DataSource = "NetSuite2.com"
AccountId = <your account id>
ServerHost = f"{AccountId}.connect.api.netsuite.com"
RoleId = <your role id>
Port = <your port>
user = dbutils.secrets.get(scope = <your defined user scope>, key = <your defined user key>)
password = dbutils.secrets.get(scope = <your defined pwd scope>, key = <your defined pwd key>)
url = f"jdbc:ns://{ServerHost}:{Port};ServerDataSource={DataSource};encrypted=1;NegotiateSSLClose=false;CustomProperties=(AccountID={AccountId};RoleID={RoleId})"
driver = "com.netsuite.jdbc.openaccess.OpenAccessDriver"
table_df = spark.read.format ( "jdbc" ) \
.option ( "driver" , driver) \
.option ( "url" , url) \
.option ( "user" , user) \
.option ( "password", password) \
.option ( "dbtable" , "customer") \
.option ( "numPartitions" , "1") \
.load ()
table_df.cache()
table_df.show()
# ... other processing steps ... #
table_df.unpersist()