Not being an SQL expert, I am struggling with the following:
I inherited a larg-ish table (about 100 million rows) containing time-stamped events that represent stage transitions of mostly shortlived phenomena. The events are unfortunately recorded in a somewhat strange way, with the table looking as follows:
phen_ID record_time producer_id consumer_id state ...
000123 10198789 start
10298776 000123 000112 hjhkk
000124 10477886 start
10577876 000124 000123 iuiii
000124 10876555 end
Each phenomenon (phen-ID) has a start event and theoretically an end event, although it might not have been occured yet and thus not recorded. Each phenomenon can then go through several states. Unfortunately, for some states, the ID is recorded in either a product or a consumer field. Also, the number of states is not fixed, and neither is the time between the states.
To beginn with, I need to create an SQL statement that for each phen-ID shows the start time and the time of the last recorded event (could be an end state or one of the intermediate states).
Just considering a single phen-ID, I managed to pull together the following SQL:
WITH myconstants (var1) as (
values ('000123')
)
select min(l.record_time), max(l.record_time) from
(select distinct * from public.phen_table JOIN myconstants ON var1 IN (phen_id, producer_id, consumer_id)
) as l
As the start-state always has the lowest recorded-time for the specific phenomenon, the above statement correctly returns the recorded time range as one row irrespective of what the end state is.
Obviously here I have to supply the phen-ID manually.
How can I make this work that so I get a row of the start times and maxium recorded time for each unique phen-ID? Played around with trying to fit in something like select distinct phen-id ...
but was not able to "feed" them automatically into the above. Or am I completely off the mark here?
Addition: Just to clarify, the ideal output using the table above would like something like this:
ID min-time max-time
000123 10198789 10577876 (min-time is start, max-time is state iuii)
000124 10477886 10876555 (min-time is start, max-time is end state)
union all
might be an option:
select phen_id,
min(record_time) as min_record_time,
max(record_time) as max_record_time
from (
select phen_id, record_time from phen_table
union all select producer_id, record_time from phen_table
union all select consumer_id, record_time from phen_table
) t
where phen_id is not null
group by phen_id
On the other hand, if you want prioritization, then you can use coalesce()
:
select coalesce(phen_id, producer_id, consumer_id) as phen_id,
min(record_time) as min_record_time,
max(record_time) as max_record_time
from phen_table
group by coalesce(phen_id, producer_id, consumer_id)
The logic of the two queries is not exactly the same. If there are rows where more than one of the three columns is not null
, and values differ, then the first query takes in account all non-null
values, while the second considers only the "first" non-null
value.
Edit
In Postgres, which you finally tagged, the union all
solution can be phrased more efficiently with a lateral join:
select x.phen_id,
min(p.record_time) as min_record_time,
max(p.record_time) as max_record_time
from phen_table p
cross join lateral (values (phen_id), (producer_id), (consumer_id)) as x(phen_id)
where x.phen_id is not null
group by x.phen_id