Search code examples
sqldatabasehivehiveqlhive-partitions

Sampling results with conditions in hive sql


I have a table that doesn't have a primary key and is partitioned by date; columns like this:

1. user_id  
2. device
3. region
4. datetime
5. and other columns

It contains user generated events from a website game, they trigger every second. I want to return a batch with all the events (including duplicate rows) generated by the first 6 users (top of the table) in the current day that check the conditions:

for region = US

- one user from iOS
- one user from android
- one user from PC

for region = EU

- one user from iOS
- one user from android
- one user from PC

Can you provide a sample code from where I should start? A friend of mine suggested something about RANK() but I never used it.

Thank you !

SELECT * FROM 
    (SELECT user_id, 
    event_post_time, 
    device, 
    region, 
    COUNT(DISTINCT player_id) over (partition by player_id) as ct_pid, 
    COUNT(DISTINCT region) over (partition by region) as ct_region, 
    COUNT(DISTINCT device) over (partition by device) as ct_device 
    FROM events 
    WHERE event_post_time = current_date() 
    AND region IN ('EU','US') 
    AND device IN ('ios','android','pc')) e 
WHERE ct_pid <= 6 
AND ct_region <= 2 
AND ct_device <= 3 
ORDER BY player_id

Adding dummy data at SQLFiddle and expected output:

user_id device region date_generated
  1  ios  EU  22-05-18
  1  ios  EU  22-05-18
  1  ios  EU  22-05-18
  4  ios  US  22-05-18
  4  ios  US  22-05-18
  2  android  EU  22-05-18
  2  android  US  22-05-18
  4  pc  EU  22-05-18
  4  pc  EU  22-05-18
  4  pc  EU  22-05-18
  5  pc  US  22-05-18

Solution

  • Probably, this is what you are looking for.

    select * from (
    select rank() over (partition by region,device order by cn desc) as
    top_num,player_id, region,device,cn from 
    (
    select count(*) as cn , player_id,region,device from 
    test_table group by player_id,region,device 
    )l
    )t 
    where top_num = 1;
    

    Let me know if this helps.

    OP EDIT: I managed to make it work for what I wanted using your provided query; here is the final one

    WITH combo 
     AS (SELECT user_id, 
                region, 
                device 
         FROM   (SELECT Rank() 
                          OVER ( 
                            partition BY region, device 
                            ORDER BY cn DESC) AS top_num, 
                        user_id, 
                        region, 
                        device, 
                        cn 
                 FROM   (SELECT Count(*) AS cn, 
                                user_id, 
                                region, 
                                device 
                         FROM   samples 
                         GROUP  BY user_id, 
                                   region, 
                                   device)l)t 
         WHERE  top_num = 1) 
    SELECT s.user_id, 
       s.region, 
       s.device 
    FROM   samples s 
       JOIN combo 
         ON s.user_id = combo.user_id 
            AND s.region = combo.region 
            AND s.device = combo.device