Search code examples
sqlhivehiveqlhadoop-partitioninghive-partitions

Hive: why to use partition by in selects?


I cannot understand partitioning concept in Hive completely. I understand what are partitions and how to create them. What I cannot get is why people are writing select statements which have "partition by" clause like it is done here: SQL most recent using row_number() over partition

SELECT user_id, page_name, recent_click
FROM (
  SELECT user_id,
         page_name,
         row_number() over (partition by session_id order by ts desc) as recent_click
  from clicks_data
) T
WHERE recent_click = 1

Why to specify partition key in selects? In any case partition key was defined during table creation. Select statement will use the partition scheme that was defined in Create Table statement. Then why to add that over (partition by session_id order by ts desc)? What if I skip over (partition by session_id order by ts desc) ?


Solution

  • Read about Hive Windowing and Analytics Functions.

    row-number() is an analytics function which numbers rows and requires over().

    In the over() you can specify for which group (partition) it will be calculated. partition by in the over is not the same as partitioned by in create table DDL and has nothing in common. in create table it means how the data is being stored (each partition is a separate folder in hive), partitioned table is used for optimizing filtering or loading data.

    partition by in the over() determines group in which function is calculated. Similar to GROUP BY in the select, but the difference is that analytics function does not change the number of rows.
    Row_number re-initializes when it crossing the partition boundary and starts with 1

    Also row_number needs order by in the over(). order by determines the order in which rows will be numbered.

    If you do not specify partition by, row_number will work on the whole dataset as a single partition. It will produce single 1 and maximum number will be equal to the number of rows in the whole dataset. Table partitioning does not affect analytics function behavior.

    If you do not specify order by, then row_number will number rows in non-deterministic order and probably different rows will be marked 1 from run to run. This is why you need to specify order by. In your example, order by ts desc means that 1 will be assigned to row with max ts (for each session_id).

    Say, if there are three different session_id and three clicks in each session with different ts (totally 9 rows), then row_number in your example will assign 1 to last click for each session and after filtering recent_click = 1 you will get 3 rows instead of 9 initially. row_number() over() without partition by will number all rows from 1 to 9 in a random order (may differ from run to run), and the same filtering will give you 8 rows mixed from all 3 sessions.

    See also this answer https://stackoverflow.com/a/55909947/2700344 for more details how it works in Hive, there is also similar question about table partition vs over() in the comments.

    Try this example, it may be better than reading too long explanation:

    with clicks_data as (
    select stack (9, 
    --session1            
    1, 1, 'page1', '2020-01-01 01:01:01.123',
    1, 1, 'page1', '2020-01-01 01:01:01.124',
    1, 1, 'page2', '2020-01-01 01:01:01.125',
    --session2            
    1, 2, 'page1', '2020-01-01 01:02:02.123',
    1, 2, 'page2', '2020-01-01 01:02:02.124',
    1, 2, 'page1', '2020-01-01 01:02:02.125',
    --session 3           
    1, 3, 'page1', '2020-01-01 01:03:01.123',
    1, 3, 'page2', '2020-01-01 01:03:01.124',
    1, 3, 'page1', '2020-01-01 01:03:01.125'                          
        ) as(user_id, session_id, page_name, ts)
    )
    
    
        SELECT
             user_id
            ,session_id
            ,page_name
            ,ts
            ,ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY ts DESC) AS rn1
            ,ROW_NUMBER() OVER() AS rn2 
        FROM clicks_data
    

    Result:

    user_id session_id  page_name   ts                     rn1  rn2
    1        2          page1      2020-01-01 01:02:02.125  1   1
    1        2          page2      2020-01-01 01:02:02.124  2   2
    1        2          page1      2020-01-01 01:02:02.123  3   3
    1        1          page2      2020-01-01 01:01:01.125  1   4
    1        1          page1      2020-01-01 01:01:01.124  2   5
    1        1          page1      2020-01-01 01:01:01.123  3   6
    1        3          page1      2020-01-01 01:03:01.125  1   7
    1        3          page2      2020-01-01 01:03:01.124  2   8
    1        3          page1      2020-01-01 01:03:01.123  3   9
         
    

    First row_number assigned 1 to rows with max timestamp in each session(partition). Second row_number without partition and order specified numbered all rows from 1 to 9. Why rn2=1 is for session2 and max timestamp in session=2, should it be random or not? Because for calculating first row_number, all rows were distributed by session_id and ordered by timestamp desc and it happened that row_number2 received session2 first(it was read by reducer before other two files prepared by mapper) and as it was already sorted for calculation of rn1, rn2 received rows in the same order. If it was no row_number1, it could be "more random". The bigger the dataset, the more random rn2 order will look.