Search code examples
azureapache-sparkazure-synapselinked-serverdelta-lake

Linked Server to Synapse Spark Tables: Queries hang if join on a STRING column is present


Setup:

  • AzureVM. Azure VM running SQL Server 2022
  • LS_CSV: Linked Server to Synapse SQL Endpoint database, which itself has Views to CSV files in Spark
  • LS_Delta: Linked Server to Synapse SQL Endpoint databse, which itself has External Tables to Spark Delta tables

Situation:

  • Azure VM --> LS_CSV queries, all work well.

  • Azure VM --> LS_Delta: Queries work well, unless they involve a join on STRING column, then they'll hit a point and just hang (presumably forever. 5 second query still churning 2 hours later). It only takes one such join, to 'ruin' the query. These queries run fine if ran from within that SQL endpoint database - just not over the linked server

More Info:

  • Looking at Live Query Statistics - just hangs on the Remote Queries that involve STRING joins.

Not getting a lot out of Live Query Stastics

  • If results on these problem queries are returned - they consistently hang on a certain row count. One query hangs on 22 rows, another on 2,300, etc.

What I've tried:

  • On-prem SQL Server --> LS_Delta. Same issue
  • CAST the joined columns to Varchar(). No change
  • The External Tables pointing at Spark Delta tables - the STRING columns are varchar(4000). Tried a test query, recreating those tables with more appropriate sizes - like Varchar(20), or varchar(50) - no change
  • The VMs databases are COLLATION SQL_Latin1_General_CP1_CI_AS, while the Spark databases are COLLATION Latin1_General_100_CI_AS_SC_UTF8. Tried creating a database on the Azure VM with a matching Latin1_General_100_CI_AS_SC_UTF8 COLLATION to run query from - same issue

Solution

  • Mostly got it (via MS Support)

    The External Tables, pointed at the Delta tables, were using varchar() for the string columns. Switching the columns to nvarchar() seems to fix it.

    Why this is the case, is not clear though.

    It also goes against the documentation [per this link on Best Practices, from MS][1]

    Use the varchar type with some UTF8 collation if you're reading data from Parquet, Azure Cosmos DB, Delta Lake, or CSV with UTF-8 encoding.

    But [shrug].
    [1]: https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-serverless-sql-pool