Problem Statement Assume there is one text file of logs. Below are the fields in the file.
Log File
userID
productID
action
Where Action would be one of these –
Browse, Click, AddToCart, Purchase, LogOut
Select users who performed AddToCart action but did not perform Purchase action.
('1001','101','201','Browse'),
('1002','102','202','Click'),
('1001','101','201','AddToCart'),
('1001','101','201','Purchase'),
('1002','102','202','AddToCart')
Can anyone suggest to get this info using hive or pig with optimised performance
This is possible to do using sum() or analytical sum() depending on exact requirements in a single table scan. What if User added to cart two products, but purchased only one?
For User+Product:
select userID, productID
from
(
select
userID,
productID,
sum(case when action='AddToCart' then 1 else 0 end) addToCart_cnt,
sum(case when action='Purchase' then 1 else 0 end) Purchase_cnt
from table
group by userID, productID
)s
where addToCart_cnt>0 and Purchase_cnt=0