Search code examples
sqlpostgresqlaggregate-functionslateral-joinpostgresql-13

SQL query : multiple challenges


Not being an SQL expert, I am struggling with the following:

I inherited a larg-ish table (about 100 million rows) containing time-stamped events that represent stage transitions of mostly shortlived phenomena. The events are unfortunately recorded in a somewhat strange way, with the table looking as follows:

phen_ID   record_time  producer_id   consumer_id  state   ...

000123    10198789                               start
          10298776     000123        000112      hjhkk
000124    10477886                               start
          10577876     000124        000123      iuiii
000124    10876555                               end

Each phenomenon (phen-ID) has a start event and theoretically an end event, although it might not have been occured yet and thus not recorded. Each phenomenon can then go through several states. Unfortunately, for some states, the ID is recorded in either a product or a consumer field. Also, the number of states is not fixed, and neither is the time between the states.

To beginn with, I need to create an SQL statement that for each phen-ID shows the start time and the time of the last recorded event (could be an end state or one of the intermediate states).

Just considering a single phen-ID, I managed to pull together the following SQL:

WITH myconstants (var1) as (
   values ('000123')
)

select min(l.record_time), max(l.record_time) from 
   (select distinct *  from public.phen_table JOIN myconstants ON var1 IN (phen_id, producer_id, consumer_id)
 ) as l

As the start-state always has the lowest recorded-time for the specific phenomenon, the above statement correctly returns the recorded time range as one row irrespective of what the end state is.

Obviously here I have to supply the phen-ID manually.

How can I make this work that so I get a row of the start times and maxium recorded time for each unique phen-ID? Played around with trying to fit in something like select distinct phen-id ... but was not able to "feed" them automatically into the above. Or am I completely off the mark here?

Addition: Just to clarify, the ideal output using the table above would like something like this:

ID         min-time      max-time
000123     10198789      10577876   (min-time is start, max-time is state iuii)
000124     10477886      10876555   (min-time is start, max-time is end state)

Solution

  • union all might be an option:

    select phen_id, 
        min(record_time) as min_record_time, 
        max(record_time) as max_record_time
    from (
        select phen_id, record_time from phen_table
        union all select producer_id, record_time from phen_table
        union all select consumer_id, record_time from phen_table
    ) t
    where phen_id is not null
    group by phen_id
    

    On the other hand, if you want prioritization, then you can use coalesce():

    select coalesce(phen_id, producer_id, consumer_id) as phen_id, 
        min(record_time) as min_record_time, 
        max(record_time) as max_record_time
    from phen_table
    group by coalesce(phen_id, producer_id, consumer_id)
    

    The logic of the two queries is not exactly the same. If there are rows where more than one of the three columns is not null, and values differ, then the first query takes in account all non-null values, while the second considers only the "first" non-null value.


    Edit

    In Postgres, which you finally tagged, the union all solution can be phrased more efficiently with a lateral join:

    select x.phen_id, 
        min(p.record_time) as min_record_time, 
        max(p.record_time) as max_record_time
    from phen_table p
    cross join lateral (values (phen_id), (producer_id), (consumer_id)) as x(phen_id)
    where x.phen_id is not null
    group by x.phen_id