Search code examples
mysqlsqlhivequery-optimizationhiveql

Querying for dates listed in another table


I want to select certain rows of a table that are between two dates (which are found in a seperate table). The details of my tables and query can be found in a previous question here (I am now interested in how to do this in HIVE/hiveQL). As my current query stands, it runs for a long time then seems to hang indefinitely, whereas when I hardcode in the dates it runs to completion fairly quickly. Tables and Query for reference:

VISIT_INFO, with these columns:

pers_key - unique identifyer for each person
pers_name - name of person
visit_date - date at which they visited a business

VALID_DATES, with these columns:

condition - string
start_date - date
end_date - date 

And the query itself:

select pers_key, pers_name from VISIT_INFO a
CROSS JOIN
(select start_date, end_date from VALID_DATES where condition = 'condition1') b
WHERE (a.visit_date >= b.start_date and a.visit_date <= b.end_date)
GROUP BY a.pers_key

Its worth noting Im using HIVE 0.12, so getting rid of the join and putting the select statement in the WHERE clause is out of the question. I'm wondering what exactly is wrong with this query, or what could be causing it to fail. Any suggestions as to how to improve this would be appreciated.


Solution

  • Try:

    select pers_key, pers_name 
    from VISIT_INFO a 
    join 
    valid_dates b
    WHERE a.visit_date BETWEEN b.start_date AND b.end_date
    GROUP BY pers_key, pers_name;
    

    As of Hive 0.13:

    select pers_key, pers_name 
    from VISIT_INFO a , valid_dates b
    WHERE a.visit_date BETWEEN b.start_date AND b.end_date
    GROUP BY pers_key, pers_name;