Search code examples
sqlperformancehivehiveql

Select every row related with a field in a table


suppose we have a table named tracks as follows:

path   user_id   item_id
------------------------
view   A         I1
view   A         I3
buy    B         I2
view   C         I3
view   A         I4
buy    A         I1
view   B         I5
view   B         I4

and we want to select the user view history, for those users who bought something, resulting in an output like

user_id  view_history   item_bought
-----------------------------------
A        [I1, I4, I3]   [I1]
B        [I4, I5]       [I2]

Here the order is unimportant, all we want is to track what the user saw if he/she bought something. What I've tried (using some HIVE notation):

    WITH cus_bought AS(
    SELECT 
        user_id
        ,COLLECT_LIST(item_id) item_bought
    FROM
        tracks
    WHERE
        path = 'buy'
    GROUP BY
        user_id
),
views AS(
    SELECT 
         t1.user_id
         ,COLLECT_LIST(item_id) event_data
    FROM 
        tracks t1
    WHERE
        path = 'view'
        AND EXISTS (
                    SELECT NULL
                    FROM 
                        tracks t2
                    WHERE
                        t2.user_id = t1.user_id
                   )
    GROUP BY 
        t1.user_id
)
SELECT c.user_id
       ,v.event_data
       ,c.item_bought
FROM
    cus_bought c
JOIN
    views v
ON c.user_id = v.user_id

Is there another way to perform this query? The table I'm facing is colossal, so the JOIN is very expensive. Is the self-JOIN avoidable? I was wondering if some sort of pivoting would help me, but not success in that direction. Finally, the reason for the EXISTS clause in the second CTE is that I want the JOIN to handle as few rows as possible (in fact, I tried to get a 1-1 relation before joining), given the size of the table.

Any help would be appreciated.


Solution

  • Can you simply use collect_set()?

    select user,
           collect_set(case when path = 'view' then item_id) as views,
           collect_set(case when path = 'buy' then item_id) as buys
    from tracks t
    group by user;
    

    EDIT:

    If you want only users that have buy and don't want them part of the outer aggregation, you can filter them out. I'm not sure if this is faster but:

    select user,
           collect_set(case when t.path = 'view' then t.item_id) as views,
           collect_set(case when t.path = 'buy' then t.item_id) as buys
    from tracks t join
         (select distinct t2.user
          from tracks t2
          where t2.path = 'buy'
         ) t2
         on t.user = t2.user
    group by user;
    

    I don't know if the overhead of the distinct and join outweighs the overhead of the aggregation, but it is worth trying.

    You can also use a window function:

    select user,
           collect_set(case when t.path = 'view' then t.item_id) as views,
           collect_set(case when t.path = 'buy' then t.item_id) as buys
    from (select t.*,
                sum(case when t.path = 'buy' then 1 else 0 end) over (partition by user) as num_buys
          from tracks t
         ) t
    where num_buys > 0
    group by user;