suppose we have a table named tracks as follows:
path user_id item_id
------------------------
view A I1
view A I3
buy B I2
view C I3
view A I4
buy A I1
view B I5
view B I4
and we want to select the user view history, for those users who bought something, resulting in an output like
user_id view_history item_bought
-----------------------------------
A [I1, I4, I3] [I1]
B [I4, I5] [I2]
Here the order is unimportant, all we want is to track what the user saw if he/she bought something. What I've tried (using some HIVE notation):
WITH cus_bought AS(
SELECT
user_id
,COLLECT_LIST(item_id) item_bought
FROM
tracks
WHERE
path = 'buy'
GROUP BY
user_id
),
views AS(
SELECT
t1.user_id
,COLLECT_LIST(item_id) event_data
FROM
tracks t1
WHERE
path = 'view'
AND EXISTS (
SELECT NULL
FROM
tracks t2
WHERE
t2.user_id = t1.user_id
)
GROUP BY
t1.user_id
)
SELECT c.user_id
,v.event_data
,c.item_bought
FROM
cus_bought c
JOIN
views v
ON c.user_id = v.user_id
Is there another way to perform this query? The table I'm facing is colossal, so the JOIN is very expensive. Is the self-JOIN avoidable? I was wondering if some sort of pivoting would help me, but not success in that direction. Finally, the reason for the EXISTS clause in the second CTE is that I want the JOIN to handle as few rows as possible (in fact, I tried to get a 1-1 relation before joining), given the size of the table.
Any help would be appreciated.
Can you simply use collect_set()
?
select user,
collect_set(case when path = 'view' then item_id) as views,
collect_set(case when path = 'buy' then item_id) as buys
from tracks t
group by user;
EDIT:
If you want only users that have buy
and don't want them part of the outer aggregation, you can filter them out. I'm not sure if this is faster but:
select user,
collect_set(case when t.path = 'view' then t.item_id) as views,
collect_set(case when t.path = 'buy' then t.item_id) as buys
from tracks t join
(select distinct t2.user
from tracks t2
where t2.path = 'buy'
) t2
on t.user = t2.user
group by user;
I don't know if the overhead of the distinct
and join
outweighs the overhead of the aggregation, but it is worth trying.
You can also use a window function:
select user,
collect_set(case when t.path = 'view' then t.item_id) as views,
collect_set(case when t.path = 'buy' then t.item_id) as buys
from (select t.*,
sum(case when t.path = 'buy' then 1 else 0 end) over (partition by user) as num_buys
from tracks t
) t
where num_buys > 0
group by user;