Search code examples
sqldatabasehiveqlimpala

HIVE/Impala query: Count the number of rows between rows fulfilling certain conditions


I need to count the number of rows that fulfill certain conditions contained in intervals defined by other rows that fulfill other conditions. Examples: the number of rows N between 'Reference' having values 1 and 4 that fulfill the condition 'Other_condition' = b is N=1, the number of rows N between 'Reference' having values 2 and 5 that fulfill the condition 'Other_condition' = b is N=2 etc.

  Date          Reference    Other_condition
20171111            1              a
20171112            2              a   
20171113            3              b
20171114            4              b  
20171115            5              b

I'm accessing the database through Hive/Impala SQL queries and unfortunately I have no idea where to start implementing such a window function. A half-pseudocode version of what I want would be something like:

SELECT COUNT (DISTINCT database.Date) AS counter, Other_condition, reference
FROM database
WHERE database.Other_condition = a AND database.Reference BETWEEN 
(window function condition 1: database.Reference = 2) AND 
(window function condition 2: database.Reference = 5)
GROUP BY counter

Solution

  • Your question is rather hard to follow. I get the first conditions, which is the number of rows between "1" and "4".

    Here is one method that should be pretty easy to generalize:

    select (max(case when reference = 4 then seqnum end) -
            max(case when reference = 1 then seqnum end)
           ) as num_rows_1_4
    from (select t.*,
                 row_number() over (order by date) as seqnum
          from t
         ) t;