Search code examples
pythonpysparkjdbcdatabricksazure-ad-msal

Token (Access) Errors Connecting to MS SQL Server From DataBricks Python Notebooks Via PySPark JDBC Driver Using an Azure Service Principal and MSAL


How do I resolve token (Active Directory access) errors when connecting to MS SQL Server from DataBricks Python notebooks via PySPark JDBC driver using an Azure/DataBricks Service Principal and MSAL (Microsoft Authentication Library)? I have tried both the PySpark JDBC driver package (current Maven coordinates: com.databricks:databricks-jdbc:2.6.25) and the Microsoft SQL Connector (current Maven coordinates: com.microsoft.azure:spark-mssql-connector_2.12:1.1.0) and MSAL. It works with ADAL but I get token errors with either of these libraries. What is the correct way to make this connection?


Solution

  • The solution is spottily described in both the Azure and DataBricks documentation (as well as SO), because BOTH the PySpark JDBC driver and the MS Connector libraries are required to connect Azure DataBricks notebooks to MS SQL Server using an Active Directory Service Principal via MSAL. This code works (using key-vaults and secrets due to limitations of the Azure Access Connector for DataBricks and the DataBricks Managed Identity, which don't interact directly with MS SQL Server):

    # library, msal, import
    import msal
    
    ##############################
    # initialize, token cache
    ##############################
    appToToken = msal.ConfidentialClientApplication(
        dbutils.secrets.get(strKeyVaultName, strKeyNameServicePrincipalId),
        client_credential = dbutils.secrets.get(strKeyVaultName, strKeyNameServicePrincipalValue),
        authority = "https://login.microsoftonline.com/" + dbutils.secrets.get(strKeyVaultName, strKeyNameTenantId))
    
    ##############################
    # token, get
    ##############################
    resGet = appToToken.acquire_token_for_client(
        scopes = ["https://database.windows.net/.default"])
    
    # validate
    if "access_token" in resGet:
    
        ##############################
        # table, sql, load
        ##############################
        dfOut = spark.read \
            .format("com.microsoft.sqlserver.jdbc.spark") \
            .option("url", strURLDatabasePathName) \
            .option("databaseName", strDatabaseNameSQL) \
            .option("dbtable", strDatabaseSchemaTableNameAP) \
            .option("accessToken", resGet["access_token"]) \
            .option("encrypt", "true") \
            .option("hostNameInCertificate", "*.database.windows.net") \
            .load()
        
        ##############################
        # dataframe, write
        ##############################
        dfOut \
            .write \
            .mode("overwrite") \
            .saveAsTable(strTableNameFLEx)
    

    NOTE: this solution uses a DataBricks key-vault scope that points to an Azure key-vault where I store the Service Principal ID and Secret along with our Azure tenant ID. The MS SQL Server name looks like this: "jdbc:sqlserver://[YOUR DATABASE NAME].database.windows.net"

    Additionally, I installed all three libraries (JDBC, MS SQL Connector and MSAL) on the DataBricks job-cluster (and/or compute policy) that runs the notebook, since the %pip magic install is not reliable for this purpose.