Search code examples
hiveapache-pig

Suggest most optimized way using hive or pig


Problem Statement Assume there is one text file of logs. Below are the fields in the file.

Log File

userID
productID
action

Where Action would be one of these –

Browse, Click, AddToCart, Purchase, LogOut

Select users who performed AddToCart action but did not perform Purchase action.

('1001','101','201','Browse'),
('1002','102','202','Click'),
('1001','101','201','AddToCart'),
('1001','101','201','Purchase'),
('1002','102','202','AddToCart')

Can anyone suggest to get this info using hive or pig with optimised performance


Solution

  • This is possible to do using sum() or analytical sum() depending on exact requirements in a single table scan. What if User added to cart two products, but purchased only one?

    For User+Product:

    select userID, productID 
      from
    (
    select 
           userID,
           productID,
           sum(case when action='AddToCart' then 1 else 0 end) addToCart_cnt,
           sum(case when action='Purchase' then 1 else 0 end)  Purchase_cnt
      from table
      group by userID, productID
    )s
    where addToCart_cnt>0 and Purchase_cnt=0