Select every row related with a field in a table

suppose we have a table named tracks as follows:

path   user_id   item_id
------------------------
view   A         I1
view   A         I3
buy    B         I2
view   C         I3
view   A         I4
buy    A         I1
view   B         I5
view   B         I4

and we want to select the user view history, for those users who bought something, resulting in an output like

user_id  view_history   item_bought
-----------------------------------
A        [I1, I4, I3]   [I1]
B        [I4, I5]       [I2]

Here the order is unimportant, all we want is to track what the user saw if he/she bought something. What I've tried (using some HIVE notation):

    WITH cus_bought AS(
    SELECT 
        user_id
        ,COLLECT_LIST(item_id) item_bought
    FROM
        tracks
    WHERE
        path = 'buy'
    GROUP BY
        user_id
),
views AS(
    SELECT 
         t1.user_id
         ,COLLECT_LIST(item_id) event_data
    FROM 
        tracks t1
    WHERE
        path = 'view'
        AND EXISTS (
                    SELECT NULL
                    FROM 
                        tracks t2
                    WHERE
                        t2.user_id = t1.user_id
                   )
    GROUP BY 
        t1.user_id
)
SELECT c.user_id
       ,v.event_data
       ,c.item_bought
FROM
    cus_bought c
JOIN
    views v
ON c.user_id = v.user_id

Is there another way to perform this query? The table I'm facing is colossal, so the JOIN is very expensive. Is the self-JOIN avoidable? I was wondering if some sort of pivoting would help me, but not success in that direction. Finally, the reason for the EXISTS clause in the second CTE is that I want the JOIN to handle as few rows as possible (in fact, I tried to get a 1-1 relation before joining), given the size of the table.

Any help would be appreciated.

Solution

Can you simply use collect_set()?

select user,
       collect_set(case when path = 'view' then item_id) as views,
       collect_set(case when path = 'buy' then item_id) as buys
from tracks t
group by user;

EDIT:

If you want only users that have buy and don't want them part of the outer aggregation, you can filter them out. I'm not sure if this is faster but:

select user,
       collect_set(case when t.path = 'view' then t.item_id) as views,
       collect_set(case when t.path = 'buy' then t.item_id) as buys
from tracks t join
     (select distinct t2.user
      from tracks t2
      where t2.path = 'buy'
     ) t2
     on t.user = t2.user
group by user;

I don't know if the overhead of the distinct and join outweighs the overhead of the aggregation, but it is worth trying.

You can also use a window function:

select user,
       collect_set(case when t.path = 'view' then t.item_id) as views,
       collect_set(case when t.path = 'buy' then t.item_id) as buys
from (select t.*,
            sum(case when t.path = 'buy' then 1 else 0 end) over (partition by user) as num_buys
      from tracks t
     ) t
where num_buys > 0
group by user;