Search code examples
sqlpostgresqlaggregatepostgresql-performance

Selecting single and only single rows from GROUP BY


I have two tables: parcel and structure, with a one-to-many relationship between them: structure.parcel_id points to parcel.id.

I want to select all the single structures. My current solution works, but is very grotesque:

SELECT 
max(column_1),
max(column_2),
max(column_3),
...
(twenty+ columns)

FROM structure
GROUP BY parcel_id
HAVING count(structure.id) = 1;

Because structure.id is non-nullable and the HAVING clause above, every group has, by definition, only one row in it. Unfortunately Postgres doesn't realize this, so if I say:

SELECT *    
FROM structure
GROUP BY parcel_id
HAVING count(structure.id) = 1;

Then I get the expected error about needing to use an aggregate function for the columns. I get around this with the arbitrary max() function, but this is confusing to someone else trying to understand the code, and it forces me to explicitly list all the columns, which means I have to dive back in and edit this code whenever a column is added. (Which, unfortunately, happens fairly frequently in my environment.)

I have this alternate solution, which solves most of my problems:

SELECT * FROM STRUCTURE
WHERE id IN (
    SELECT
        max(id) as id
    FROM structure
    GROUP by structure.parcel_id
    HAVING count(structure.id)  = 1
    );

But this is clearly adding unnecessary slowness to my query which I'd like to avoid because of the frequency of the query and the size of the table.

This question is very similar to what I'm asking, but it will grab the first row of every group, not the first (and only) row of singular groups.

Is there an elegant way to solve this problem?

Sample data per request:

structure table:

id | parcel_id | column_1 | column_2 | ...
------------------------------------------
1  |   536     |   ...    | ....     | ...
2  |   536     |   ...    | ....     | ...
3  |   537     |   ...    | ....     | ...
4  |   538     |   ...    | ....     | ...
5  |   538     |   ...    | ....     | ...
6  |   539     |   ...    | ....     | ...
7  |   540     |   ...    | ....     | ...
8  |   541     |   ...    | ....     | ...
9  |   541     |   ...    | ....     | ...

Desired result:

id | parcel_id | column_1 | column_2 | ...
------------------------------------------
3  |   537     |   ...    | ....     | ...
6  |   539     |   ...    | ....     | ...
7  |   540     |   ...    | ....     | ...

Note that 537, 539, and 540 are the only parcel_id's that don't repeat.

Both tables have ~1.5 million rows and ~25 columns.


Solution

  • How about using window functions?

    SELECT s.*    
    FROM (SELECT s.*, COUNT(*) OVER (PARTITION BY parcel_id) as cnt
          FROM structure s
         ) s
    WHERE cnt = 1;
    

    However, a more efficient method might be:

    select s.*
    from structure s
    where not exists (select 1
                      from structure s2
                      where s2.parcel_id = s.parcel_id and s2.id<> s.id
                     );
    

    In particular, this can take advantage of an index on structure(parcel_id, id).