Search code examples
sql-serverjoinuniquewindow-functionscartesian-product

How can I match a row in SQL Server only once?


I have the following problem where I am kindly asking for your help when joining two tables in SQL Server 2016 (v13).

I have 2 tables, Revenues and Cashins.

Revenues:

RevenueID ProductID InvoiceNo Amount
123 456 987 1000
234 456 987 1000

Cashins:

CashinID ProductID InoviceNo Amount
ABC 456 987 1000
CDE 456 987 1000

The goal is to match cashins automatically to revenues (but only once!).

Both tables have their unique-ids but the columns used to join these tables are

  • ProductID
  • InvoiceNo
  • Amount

For entries with only one row in each table with those criteria, everything works fine.

Sometimes though, there are several rows that have the same value within these columns (as above) but with a unique ID (this is no error, but the way it is supposed to be).

The problem with it is, that while joining it results in a cartesian product.

To recreate the tables, here the statements:

DROP TABLE IF EXISTS Revenues
GO

CREATE TABLE Revenues 
(
    RevenueID [nvarchar](10) NULL,  
    ProductID [nvarchar](10) NULL,  
    InvoiceNo [nvarchar](10) NULL,      
    Amount money NULL
)
GO

DROP TABLE IF EXISTS CashIns
GO

CREATE TABLE CashIns 
(
    CashinID [nvarchar](10) NULL,
    ProductID [nvarchar](10) NULL,  
    InvoiceNo [nvarchar](10) NULL,      
    Amount money NULL
)
GO

INSERT INTO [Revenues] VALUES ('123', '456', '987', 1000)
INSERT INTO [Revenues] VALUES ('234', '456', '987', 1000)

INSERT INTO [CashIns] VALUES ('ABC', '456', '987', 1000)
INSERT INTO [CashIns] VALUES ('BCD', '456', '987', 1000)

Desired output:

RevenueID ProductID InvoiceNo Amount CashinID
123 456 987 1000 ABC
234 456 987 1000 CDE
SELECT 
    R.RevenueID,
    R.ProductID,
    R.InvoiceNo,
    R.Amount,
    C.CashinID,
FROM 
    [Revenues] R
LEFT JOIN 
    [CashIns] C ON R.ProductID = C.ProductID
                AND R.InvoiceNo = C.InvoiceNo
                AND R.Amount = C.Amount

Results:

RevenueID ProductID InvoiceNo Amount CashinID
123 456 987 1000 ABC
123 456 987 1000 CDE
234 456 987 1000 ABC
234 456 987 1000 CDE

Which in theory makes sense, but I just can't seem to find a solution where each row is just used once.

Two things I found and tried are windowing functions and the OUTER APPLY function with a TOP(1) selection. Both came to the same result:

SELECT
    *
FROM 
    [Revenues] R
OUTER APPLY 
    (SELECT TOP(1) *
     FROM [CashIns] C) C

Which returns the desired columns from the Revenues table, but only matched the first appearance from the Cashins table:

RevenueID ProductID InvoiceNo Amount CashinID
123 456 987 1000 ABC
234 456 987 1000 ABC

I also thought about something like updating the Revenues table, so that the matched CashinID is next to a line and then check every time that the CashinID is not yet used within that table, but I couldn't make it work...

Many thanks in advance for any help or hint in the right direction!


Solution

  • As I said in my comment, you have a fundamental problem with your data relationships. You need to reference the unique identifier of the other table in one of your tables. If you don't do that, then you can only order your transactions in both tables and join them by the row number. You're using a hope and prayer to join your data instead of unreputable identifier's.

    --This example orders the transactions in each transaction table and uses
    --the order number to join them.
    WITH RevPrelim AS (
        SELECT *
          , ROW_NUMBER() OVER(PARTITION BY InvoiceNo, ProductID, Amount ORDER BY RevenueID) AS row_num
        FROM [Revenues] R
    ), CashinsPrelim AS (
        SELECT *
            , ROW_NUMBER() OVER(PARTITION BY InvoiceNo, ProductID, Amount ORDER BY CashinID) AS row_num
        FROM [CashIns] AS C
    )
    SELECT *
    FROM RevPrlim AS r
        LEFT OUTER JOIN CashinsPrelim AS c
            ON c.ProductID = r.ProductID
                AND c.InvoiceNo = r.InvoiceNo
                AND c.Amount = r.Amount
                AND c.row_num = r.row_num