Search code examples
postgresqlselectdistinct

Select first non-null value for multiple columns and different rows in PostgreSQL


I'm trying to create a view based off a table. I want to get a set of rows where there is an existing tax_id_no, with each row having the most recent information. So I'm ordering by timestamps descending. However, each tax_id_no can have multiple rows, and not every row will have all the information. So I want to get the first valid piece of information for each column. Right now I've got this:

SELECT * FROM
(
SELECT DISTINCT ON (store_id, tax_id_no)
    event_id,
    event_tstamp,
    owner_id,
    store_id,
    tax_id_no,
    first_value(year_built) OVER (ORDER BY year_built IS NULL, event_tstamp) AS year_built, --New
    first_value(roof_replaced_year) OVER (ORDER BY roof_replaced_year IS NULL, event_tstamp) AS roof_replaced_year, --New
    first_value(number_of_rooms) OVER (ORDER BY number_of_rooms IS NULL, event_tstamp) AS number_of_rooms, --New

FROM MySchema.Event
WHERE tax_id_no IS NOT NULL AND tax_id_no != ''
order by store_id, tax_id_no, event_tstamp DESC
) t1
WHERE owner_id IS NOT NULL OR owner_id != '';

This is getting the same first valid information for every row though. So instead of getting results like this, which is what I want:

event_id    event_tstamp    owner_id    store_id    tax_id_no   year_built  roof_replaced_year  number_of_rooms
04          2016-05-12      123         02          12345       1996        2009                6
05          2017-02-02      245         02          23456       1970        1999                8
08          2017-03-03      578         03          34567       2002        2016                10

I'm getting this, which all the rows looking the same in the first_value columns:

event_id    event_tstamp    owner_id    store_id    tax_id_no   year_built  roof_replaced_year  number_of_rooms
04          2016-05-12      123         02          12345       1996        2009                6
05          2017-02-02      245         02          23456       1996        2009                6
08          2017-03-03      578         03          34567       1996        2009                6

Is it possible to select a different first_value for each row? I was thinking I could do some kind of a join across multiple selects from the same table, but I'm not sure that would actually give me unique values for each row instead of just having the same problem again. There's also the length of time for such queries to consider, which so far have been prohibitively expensive.


Solution

  • You can use a partition in your window functions to group the rows before applying the function. That will generate a distinct result for each partition.

    For example:

    first_value(number_of_rooms) OVER (
        PARTITION BY tax_id_no 
        ORDER BY number_of_rooms IS NULL, event_tstamp
    ) AS number_of_rooms,