Search code examples
sqlsql-server-2008query-optimizationlinked-server

Two queries. Same Output. One takes 2 hours and the other 0 seconds. Why?


I have some IDs inserted into a temp table #A as follows:

SELECT DISTINCT ID
INTO #A
FROM LocalDB.dbo.LocalTable1
WHERE ID NOT IN (SELECT DISTINCT ID FROM LocalDB.dbo.LocalTable2)
GO

CREATE INDEX TT ON #A(ID)
GO

I am trying to obtain some information from a remote linked server using the identifiers I gathered in the previous stage:

Query 1:

SELECT ID, Desc
FROM RemoteLinkedServer.DB.dbo.RemoteTable X
WHERE ID IN (SELECT ID FROM #A)

Query 2:

SELECT ID, Desc
FROM RemoteLinkedServer.DB.dbo.RemoteTable X
INNER JOIN #A Y
ON X.ID = Y.ID

Now in the following query, what I am doing is obtain the output of the temp table, copy the rows and format them properly into a comma-separated list and manually putting it in the query.

Query 3:

SELECT ID, Desc
FROM RemoteLinkedServer.DB.dbo.RemoteTable X
WHERE ID IN (-- Put all identifiers here --)

Queries 1 and 2 take 2 hours to execute and query 3 takes 0 seconds (my temp table contains about 200 rows). I don't know what's going on and do not have permissions to check if the remote server has the relevant indexes on ID but it is simply baffling to see that a manually constructed query runs in no time indicating that there is something that is going wrong at the query optimization phase.

Any ideas on what's going wrong here or how I could speed up my query?


Solution

  • Queries 1 and 2 cause ALL of the data in the RemoteTable to be pulled into your local database in order to perform the join operation. This is going to eat RAM, network bandwidth and generally be very slow while the query is executing.

    Query 3 allows the remote server to filter down the results to send just those matches you want.

    Basically, it boils down to who does the work. Queries 1/2 require your local DB to do it; Query 3 lets the remote one do it.

    If you have a lot of data in that remote table, then you'll likely run into network congestion etc.

    The best approach to querying linked servers is to construct your queries such as the remote server does all the work and just sends results back to your local one. This will optimize the amount of network, memory and disk resources required to get the data you want.

    Any time you have to join across server boundaries (using a linked server) it's going to be a disaster.