postgresql distinct plpgsql common-table-expression distinct-on

PL/pgSQL : Does DISTINCT ON passes through with it's result onto next CTE?

My code looks like this (schema), because it's pretty huge:

something AS ( SELECT DISTINCT ON (x1,x2,x3,x4) ... ),
something2 AS (xx.*, ... FROM something xx LEFT JOIN ...),
something3 AS (xx.*, ... FROM something2 xx LEFT JOIN ...)
SELECT ... FROM something3

So imagine this situation: After using DISTINCT ON (x1,x2,x3,x4) in something and then

select * from something

(ignoring something2 and something3 here) : I get 1700 results.

But the problem is this is not my expected result (yet), because I need to use some more CTE to leftjoin some information

So when I have the same DISTINCT ON in something and I do

select * from something3 (which is a final expected result that should return 1700 rows)

I suddenly get 4000 results with the values that I wanted to distinct earlier on in something.

It seems like I'm losing the DISTINCT I've typed in something, because when I put the same syntax:

DISTINCT ON (x1,x2,x3,x4) ... in all three something's then I get 1700 results - But it's not really what I'm looking for.

Time means much to me.

Could someone help me out with the solution and better understanding of the problem here?

Solution

This happens, because:

CTEs can be thought of as defining temporary tables that exist just for one query.

Which means, that doesn't matter, if you defined one of your CTE with a DISTINT ON clause, because other CTEs (and your main query) will see only a temporary table (or more like a result-set), but nothing more. If use use join on this temporary table, you could end up more results (just like with normal tables).

To ensure your main query does not contain duplicates, move the DISTINT ON clause there (or, in theory, all of your CTEs can have it - but at least the last one should have).

F.ex. these should produce your desired output:

WITH s AS (SELECT x FROM t),
s2 AS (SELECT x FROM s),
s3 AS (SELECT x FROM s2),
SELECT DISTINCT x FROM s3;

-- vs.

WITH s AS (SELECT x FROM t),
s2 AS (SELECT x FROM s),
s3 AS (SELECT DISTINCT x FROM s2),
SELECT x FROM s3;

-- vs.

WITH s AS (SELECT DISTINCT x FROM t),
s2 AS (SELECT DISTINCT x FROM s),
s3 AS (SELECT DISTINCT x FROM s2),
SELECT x FROM s3;

-- this last version should be only used, when you have
-- enormous joins, so you want to keep your "temporary"
-- tables' size smaller in the memory