Search code examples
group-byduplicatesamazon-redshiftdistinctcommon-table-expression

Redshift duplicated rows count mismatch using CTE due to table primary key configuration


It looks like I've come across a Redshift bug/inconsistency. I explain my original question first and include below a reproducible example.

Original question

I have a table with many columns in Redshift with some duplicated rows. I've tried to determine the number of unique rows using CTEs and two different methods: DISTINCT and GROUP BY.
The GROUP BY method looks something like this:

WITH duplicated_rows as 
(SELECT *, COUNT(*) AS q
FROM my_schema.my_table
GROUP BY  1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 
46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 
61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 
76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 
91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104)
---
SELECT COUNT(*) count_unique_rows, SUM(q) count_total_rows
FROM duplicated_rows

With this query I get this result:

count_unique_rows | count_total_rows
------------------------------------
      27          |        83

Then I use the DISTINCT method

WITH unique_rows as 
(SELECT DISTINCT *
FROM my_schema.my_table)
---
SELECT COUNT(*) as count_unique_rows
FROM unique_rows

And I get this result:

count_unique_rows 
-----------------
      63

So the CTE with GROUP BY seems to indicate 27 unique rows and CTE with the DISTINCT shows 63 unique rows.
As the next troubleshooting step I executed the GROUP BY outside the CTE and it yields 63 rows!
I also exported the 83 original rows to excel and applied the remove duplicated function and 63 rows remained, so that seems to be the correct number.
What I can't understand, for the life of me, is where the number 27 comes from when I use the CTE combined with the GROUP BY.
Is there a limitation with CTEs and Redshift that I'm not aware of? Is it a bug in my code? Is it a bug in Redshift?
Any help in clarifying this mystery would be greatly appreciated!!

Reproducible Example

Create and populate the table

create table my_schema.students
(name VARCHAR(100),
day DATE,
course VARCHAR(100),
country VARCHAR(100),
address VARCHAR(100),
age INTEGER,
PRIMARY KEY (name))

INSERT INTO my_schema.students 
VALUES
('Alan', '2000-07-15', 'Physics', 'CA', '12th Street', NULL),
('Alan', '2021-01-15', 'Math', 'USA', '8th Avenue', 21),
('Jane', '2021-01-16', 'Chemistry', 'USA', NULL, 21),
('Jane', '2021-01-16', 'Chemistry', 'USA', NULL, 21),
('Patrick', '2021-07-16', 'Chemistry', NULL, NULL, 21),
('Patrick', '2021-07-16', 'Chemistry', NULL, NULL, 21),
('Kate', '2018-07-20', 'Literature', 'AR', '8th and 23th', 18),
('Kate', '2021-10-20', 'Philosophy', 'ES', NULL, 30);

Calculate unique rows with CTE and GROUP BY

WITH duplicated_rows as 
(SELECT *, COUNT(*) AS q
FROM my_schema.students
GROUP BY  1, 2, 3, 4, 5, 6)
---
SELECT COUNT(*) count_unique_rows, SUM(q) count_total_rows
FROM duplicated_rows

The result is INCORRECT!

count_unique_rows | count_total_rows
-------------------------------------
      4           |         8

Calculate unique rows with CTE and DISTINCT

WITH unique_rows as 
(SELECT DISTINCT *
FROM my_schema.students)
---
SELECT COUNT(*) as count_unique_rows 
FROM unique_rows

The result is CORRECT!

count_unique_rows
-----------------
      6

The core of the issue seems to be the primary key, which Redshift doesn't enforce, but uses it for some kind of lazy evaluation to determine row differences within a CTE, which leads to inconsistent results.


Solution

  • The strange behaviour is caused by this line:

    PRIMARY KEY (name)
    

    From Defining table constraints - Amazon Redshift:

    Uniqueness, primary key, and foreign key constraints are informational only; they are not enforced by Amazon Redshift. Nonetheless, primary keys and foreign keys are used as planning hints and they should be declared if your ETL process or some other process in your application enforces their integrity.

    For example, the query planner uses primary and foreign keys in certain statistical computations. It does this to infer uniqueness and referential relationships that affect subquery decorrelation techniques. By doing this, it can order large numbers of joins and eliminate redundant joins.

    The planner leverages these key relationships, but it assumes that all keys in Amazon Redshift tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results. For example, a SELECT DISTINCT query might return duplicate rows if the primary key is not unique. Do not define key constraints for your tables if you doubt their validity. On the other hand, you should always declare primary and foreign keys and uniqueness constraints when you know that they are valid.

    In your sample data, the PRIMARY KEY clearly cannot be name because there are multiple rows with the same name. This violates assumptions made by Redshift and can lead to incorrect results.

    If you remove the PRIMARY KEY (name) line, the data results are correct.

    (FYI, I discovered this by running your commands in sqlfiddle.com against a PostgreSQL database. It would not allow the data to be inserted because it violated the PRIMARY KEY condition.)