Search code examples
sqlperformancejoinoptimizationsubquery

Improving Performance on Query Using Subquery in SELECT Statement


I've written a query for a report which is designed to pull some of information from a system transaction log (transaction_log), while also looking for a Master value from a second table (location_table) - this does not always exist, as sometimes the master_location or location_id fields can be NULL.

I have come up with the below query which produces my desired results, although it takes much longer to run the report since adding the 'master_location' column to the query.

I believe the performance of the query/report can probably be improved by ditching the subquery within the SELECT statement and using a join instead, but I am rusty with SQL at the moment and can't get it working.

Can anyone please suggest improvements to this query to improve the performance?

SELECT 
       transaction_type, 
       description,
       id_num,
       product_id,
       quantity,
       location_id,
       CASE WHEN transaction_log.location_id IS NOT NULL
            THEN ( SELECT TOP 1
                          master_location_id 
                     FROM location_table t1 
                    WHERE t1.location_id = transaction_log.source_location_id
                      AND master_location_id IS NOT NULL
                 )
            ELSE NULL
       END as master_location,
       dummy_value, 
       employee_id,
 FROM 
      transaction_log WITH (NOLOCK)

I tried a join as below, which is returning over 2000 rows too many (compared to the above query). Should I maybe be filtering down the results further in a WHERE clause?

SELECT 
       transaction_type, 
       description,
       id_num,
       product_id,
       quantity,
       location_id,
       master_location_id,
       dummy_value, 
       employee_id,
  FROM 
        transaction_log  t1 WITH (NOLOCK)
        INNER JOIN ( SELECT location_id,
                            master_location_id 
                       FROM location_table)  t2
                ON t1.location_id = t2.location_id

Solution

  • Situation

    • transaction_log.location_id can sometimes be NULL, so you want to make sure you get transaction log entries that have NULL locations. That means you need an outer join.
    • location_table.master_location_id will always be the same as each other or NULL, so you want to narrow down this list so there is only one row per location_id, master_location_id combination.
    • Further complicating this is the fact that master_location_id can sometimes be NULL, so, when narrowing down the list, you want to pull away the NULL values. All of them can be pulled out, as argued in the note below.

    One Solution

    SELECT 
           T_LOG.transaction_type, 
           T_LOG.description,
           T_LOG.id_num,
           T_LOG.product_id,
           T_LOG.quantity,
           T_LOG.location_id,
           U_LOC.master_location_id,
           T_LOG.dummy_value, 
           T_LOG.employee_id,
      FROM 
           transaction_log  T_LOG  WITH (NOLOCK)
           LEFT OUTER JOIN (
                               SELECT T_LOC.location_id,
                                      T_LOC.master_location_id 
                                 FROM location_table  T_LOC
                                WHERE T_LOC.master_location_id IS NOT NULL
                             GROUP BY T_LOC.location_id,
                                      T_LOC.master_location_id
                           )  U_LOC
                        ON T_LOG.location_id = U_LOC.location_id
    ;
    

    Note: Consider the following table values for location_table:

    
    +-------------+--------------------+
    | location_id | master_location_id |
    +-------------+--------------------+
    |          10 | 3                  |
    |          10 | 3                  |
    |          20 | NULL               |
    |          30 | 7                  |
    |          30 | NULL               |
    +-------------+--------------------+
    

    Removing all NULL master_location_id rows and grouping by both columns produces this:

    +-------------+--------------------+
    | location_id | master_location_id |
    +-------------+--------------------+
    |          10 |                  3 |
    |          30 |                  7 |
    +-------------+--------------------+
    

    There is an apparent loss of the row where location_id is 20. However, if the value were there, master_location_id would be NULL. Since it is not there, the LEFT OUTER JOIN will return a value of NULL for it anyway. So there is no issue.

    There would only be an issue if you wanted to interpret a missing row in location_table differently from the way you would a NULL value in location_table.master_location_id. That is outside the scope of this question.