Search code examples
sqljoin

Why do multiple-table joins produce duplicate rows?


Let's say I have three tables A, B, and C. Each has two columns: a primary key and some other piece of data. They each have the same number of rows. If I JOIN A and B on the primary key, I should end up with the same number of rows as are in either of them (as opposed to A.rows * B.rows).

Now, if I JOIN A JOIN B with C, why do I end up with duplicate rows? I have run into this problem on several occasions and I do not understand it. It seems like it should produce the same result as JOINing A and B since it has the same number of rows but, instead, duplicates are produced.

Queries that produce results like this are of the format

SELECT *
FROM M
    INNER JOIN S
        on M.mIndex = S.mIndex
    INNER JOIN D
        ON M.platformId LIKE '%' + D.version + '%'
    INNER JOIN H
        ON D.Name = H.Name
        AND D.revision = H.revision

Here are schemas for the tables. H contains is a historic table containing everything that was ever in D. There are many M rows for each D and one S for each M.

Table M

    [mIndex] [int] NOT NULL PRIMARY KEY,
    [platformId] [nvarchar](256) NULL,
    [ip] [nvarchar](64) NULL,
    [complete] [bit] NOT NULL,
    [date] [datetime] NOT NULL,
    [DeployId] [int] NOT NULL PRIMARY KEY REFERENCES D.DeployId,
    [source] [nvarchar](64) NOT NULL PRIMARY KEY

Table S

[order] [int] NOT NULL PRIMARY KEY,
[name] [nvarchar](64) NOT NULL,
[parameters] [nvarchar](256) NOT NULL,
[Finished] [bit] NOT NULL,
[mIndex] [int] NOT NULL PRIMARY KEY,
[mDeployId] [int] NOT NULL PRIMARY KEY,
[Date] [datetime] NULL,
[status] [nvarchar](10) NULL,
[output] [nvarchar](max) NULL,
[config] [nvarchar](64) NOT NULL PRIMARY KEY

Table D

[Id] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[branch] [nvarchar](64) NOT NULL,
[revision] [int] NOT NULL,
[version] [nvarchar](64) NOT NULL,
[path] [nvarchar](256) NOT NULL

Table H

[IdDeploy] [int] IDENTITY(1,1) NOT NULL,
[name] [nvarchar](64) NOT NULL,
[version] [nvarchar](64) NOT NULL,
[path] [nvarchar](max) NOT NULL,
[StartDate] [datetime] NOT NULL,
[EndDate] [datetime] NULL,
[Revision] [nvarchar](64) NULL,

I didn't post the tables and query initially because I am more interested in understanding this problem for myself and avoiding it in the future.


Solution

  • If one of the tables M, S, D, or H has more than one row for a given Id (if just the Id column is not the Primary Key), then the query would result in "duplicate" rows. If you have more than one row for an Id in a table, then the other columns, which would uniquely identify a row, also must be included in the JOIN condition(s).

    References:

    Related Question on MSDN Forum