Search code examples
t-sqlintersect

Using "UNION ALL" and "GROUP BY" to implement "Intersect"


I'v provided following query to find common records in 2 data sets but it's difficult for me to make sure about correctness of my query because of that I have a lot of data records in my DB.

Is it OK to implement Intersect between "Customers" & "Employees" tables using UNION ALL and apply GROUP BY on the result like below?

SELECT D.Country, D.Region, D.City
  FROM (SELECT DISTINCT Country, Region, City 
          FROM Customers
         UNION ALL
        SELECT DISTINCT Country, Region, City
          FROM Employees) AS D
GROUP BY D.Country, D.Region, D.City
HAVING COUNT(*) = 2;

So can we say that any record which exists in the result of this query also exists in the Intersect set between "Customers & Employees" tables AND any record that exists in Intersect set between "Customers & Employees" tables will be in the result of this query too?


Solution

  • So is it right to say any record in result of this query is in "Intersect" set between "Customers & Employees" "AND" any record that exist in "Intersect" set between "Customers & Employees" is in result of this query too?

    YES.

    ... Yes, but it won't be as efficient because you are filtering out duplicates three times instead of once. In your query you're

    1. Using DISTINCT to pull unique records from employees
    2. Using DISTINCT to pull unique records from customers
    3. Combining both queries using UNION ALL
    4. Using GROUP BY in your outer query to to filter the records you retrieved in steps 1,2 and 3.

    Using INTERSECT will return identical results but more efficiently. To see for yourself you can create the sample data below and run both queries:

    use tempdb
    go
    if object_id('dbo.customers') is not null drop table dbo.customers;
    if object_id('dbo.employees') is not null drop table dbo.employees;
    
    create table dbo.customers
    (
      customerId int identity,
      country    varchar(50),
      region     varchar(50),
      city       varchar(100)
    );
    
    create table dbo.employees
    (
      employeeId int identity,
      country    varchar(50),
      region     varchar(50),
      city       varchar(100)
    );
    
    insert dbo.customers(country, region, city) 
    values ('us', 'N/E', 'New York'), ('us', 'N/W', 'Seattle'),('us', 'Midwest', 'Chicago');
    insert dbo.employees
    values ('us', 'S/E', 'Miami'), ('us', 'N/W', 'Portland'),('us', 'Midwest', 'Chicago');
    

    Run these queries:

    SELECT D.Country, D.Region, D.City
    FROM 
    (
      SELECT DISTINCT Country, Region, City 
      FROM Customers
      UNION ALL
      SELECT DISTINCT Country, Region, City
      FROM Employees
    ) AS D
    GROUP BY D.Country, D.Region, D.City
    HAVING COUNT(*) = 2;
    
    SELECT Country, Region, City
    FROM dbo.customers
    INTERSECT
    SELECT Country, Region, City
    FROM dbo.employees;
    

    Results:

    Country     Region     City
    ----------- ---------- ----------
    us          Midwest    Chicago
    
    Country     Region     City
    ----------- ---------- ----------
    us          Midwest    Chicago
    

    If using INTERSECT is not an option OR you want a faster query you could improve the query you posted a couple different ways, such as:

    Option 1: let GROUP BY handle ALL the de-duplication like this:

    This is the same as what you posted but without the DISTINCTS

    SELECT D.Country, D.Region, D.City
    FROM 
    (
      SELECT Country, Region, City 
      FROM Customers
      UNION ALL
      SELECT Country, Region, City
      FROM Employees
    ) AS D
    GROUP BY D.Country, D.Region, D.City
    HAVING COUNT(*) = 2;
    

    Option 2: Use ROW_NUMBER

    This would be my preference and will likely be most efficient

    SELECT Country, Region, City
    FROM 
    (
      SELECT
        rn = row_number() over (partition by D.Country, D.Region, D.City order by (SELECT null)), 
        D.Country, D.Region, D.City
      FROM 
      (
        SELECT Country, Region, City 
        FROM Customers
        UNION ALL
        SELECT Country, Region, City
        FROM Employees
      ) AS D
    ) uniquify
    WHERE rn = 2;