Search code examples
sqlhql

how to query multiple months in where clauses in hql?


SELECT *
FROM table_name
WHERE
    from_unixtime(end_time,'yyyyMMdd') >= '20211201'
and from_unixtime(start_time,'yyyyMMdd') <= '20211231'

I want to substitute '20211201' and '20211231' for every month between year 2020 to 2021, one way to do it is to substitute strings by hands and split it into multiple queries like this:

SELECT *
FROM table_name
WHERE
    from_unixtime(end_time,'yyyyMMdd') >= '20211201'
and from_unixtime(start_time,'yyyyMMdd') <= '20211231'
SELECT *
FROM table_name
WHERE
    from_unixtime(end_time,'yyyyMMdd') >= '20211101'
and from_unixtime(start_time,'yyyyMMdd') <= '20211130'

Is there any way I can put them in one query? I'm thinking a for loop or full join a column with

month    
20211031
20221130
20211231
...

but I don‘t know a elegant way to generate a series of months between two dates either..


Solution

  • WITH table_name AS (
     SELECT 0 user_id, unix_timestamp('2022-01-02 12:00:00') start_time, unix_timestamp('2022-01-04 12:00:00') end_time  UNION ALL
     SELECT 1 user_id, unix_timestamp('2022-01-04 12:00:00') start_time, unix_timestamp('2022-01-08 12:00:00') end_time  UNION ALL
     SELECT 2 user_id, unix_timestamp('2022-02-02 12:00:00') start_time, unix_timestamp('2022-04-04 12:00:00') end_time  UNION ALL
     SELECT 3 user_id, unix_timestamp('2022-02-04 12:00:00') start_time, unix_timestamp('2022-02-08 12:00:00') end_time  UNION ALL
     SELECT 3 user_id, unix_timestamp('2022-02-02 12:00:00') start_time, unix_timestamp('2022-02-04 12:00:00') end_time UNION ALL
     SELECT 4 user_id, unix_timestamp('2022-04-02 12:00:00') start_time, unix_timestamp('2022-05-05 12:00:00') end_time
    ),
    explodes AS (
      SELECT *, date_format(add_months(from_unixtime(t.start_time), diff), 'yyyyMM') month
        FROM table_name t
     LATERAL VIEW EXPLODE (
          -- this hack is due to lack of supporting *generate_array* functions and non-equi joins in Hive
          -- you can increase below map based on the period seen on your data
          map(0, array(0), 1, array(0, 1), 2, array(0, 1, 2), 3, array(0, 1, 2, 3), 4, array(0, 1, 2, 3, 4))[
              int(months_between(from_unixtime(t.end_time), from_unixtime(t.start_time)))
          ]) tf AS diff
    )
    SELECT month, COUNT(DISTINCT user_id) mau FROM explodes GROUP BY month;
    

    previous approach with more limitations

    Assuming that

    1. your data is big enough not to miss any months between period you're considering.
    2. you can set hive option below for non-equi join

    Cartesian products are disabled for safety reasons. If you know what you are doing, please sethive.strict.checks.cartesian.product to false and that hive.mapred.mode is not set to 'strict' to proceed.

    WITH table_name AS (
      -- same as above
    ),
    months AS (
      SELECT from_unixtime(start_time, 'yyyyMM') year_month FROM table_name
      UNION DISTINCT
      SELECT from_unixtime(end_time, 'yyyyMM') FROM table_name
    )
    SELECT m.year_month, COUNT(DISTINCT user_id) mau
      FROM months m LEFT JOIN table_name t ON m.year_month BETWEEN from_unixtime(t.start_date, 'yyyyMM') AND from_unixtime(t.end_date, 'yyyyMM')
     GROUP BY year_month;