Search code examples
sqlamazon-athenaprestotrino

Discrepancy between two similar presto SQL queries


I have the following SQL queries that should written similar results wrt offset column, but dont.

Query 1:

SELECT visitor_id, array_agg(timestamp) as time, array_agg(offset) as offset_list 
from
(SELECT * FROM 
(
SELECT visitor_id, timestamp,
cast(json_extract(uri_args, '$.offset') AS int) as offset


FROM table_t 
where year = 2023 and month = 1 and day = 27 and 

request_uri = '/home_page')
order by visitor_id, timestamp)
group by visitor_id 
order by cardinality(offset_list) desc 

Query 2:

SELECT visitor_id ,array_agg(offset) as offset_list 
from
(SELECT * FROM 
(
SELECT visitor_id, timestamp,
cast(json_extract(uri_args, '$.offset') AS int) as offset


FROM table_t 
where year = 2023 and month = 1 and day = 27 and 

request_uri = '/home_page')
order by visitor_id, timestamp)
group by visitor_id 
order by cardinality(offset_list) desc

Here uri_args is simply a json file which under the key 'offset' contains the value of the offset for the particular API response. This is from response log table of a server.

Although the two queries are similar and according to me ought to return the same values in the offset_list column i find the following discrepancy :

I will consider a particular visitor_id to convey it clearly, for a visitor_id ='12345' query i returns the following row in the offset_list col

[0, 0, 0, 10, 0, 10, 20, 32, 42, 0, 0, 20, 53, 77, 57, 0, 10, 20, 31, 10, 41, 0, 10, 41, 54, 77, 0, 10, 31, 41, 54, 10, 31, 54, 57, 77, 10, 20, 32, 0, 10, 21, 33, 44, 72, 52, 0, 10, 20, 31, 41]

and for query 2 the output is as follows :

[20, 32, 42, 0, 0, 20, 53, 77, 57, 0, 10, 20, 31, 10, 41, 0, 10, 41, 54, 77, 0, 10, 31, 41, 54, 10, 31, 54, 77, 57, 10, 20, 32, 0, 10, 21, 33, 44, 72, 52, 0, 10, 20, 31, 41, 0, 0, 0, 10, 0, 10]

I can observe the the two are circular permutations of each other but fail to see why this is happening. Please help me understand what the difference is the inner working of each query. The first reply suits the intent of my quest which is to capture the visitors journey on the homepage.


Solution

  • If you need your array elements deterministically ordered specify order by clause for aggregate function as mentioned in the docs

    Some aggregate functions such as array_agg() produce different results depending on the order of input values. This ordering can be specified by writing an ORDER BY Clause within the aggregate function

    SELECT visitor_id ,array_agg(offset order by timestamp) as offset_list 
    -- ...