Search code examples
javaspring-boothive

Retrieve million records in batches from Apache Hive - Java 8 + Spring Boot + Hive 1.2.1 version


I need to retrieve 10 million rows from Hive.

String sql = select * from table_name

List<Map<String, Object>> resultSet = jdbctemplate.queryForList(String sql)

The above method runs well to retrieve 1 million rows at once(single hit) with 2GB of Heap Memory. It takes 3-4 minutes only to select records from a table size of 30 MB(1 million rows).

But for more than 1 million records, there are memory issues, and takes more time.

I need to query Hive with OFFSET values, but for the 1.2.1 version, there is no OFFSET clause it seems.

Is there any other way to select records from Hive as Batch? Select the first 10K records and the next 10K like that?


Solution

  • If you have some Primary Key candidate (can be a list of columns) which can be used in order by then you can use row_number():

    select --column list here
    from (
        select t.*, row_number() OVER (ORDER by PK) as rn --use PK in order by
        from table_name t
        ) s
    where rn between 1000001 and 2000000
    

    Just check that your PK candidate is unique and not null, because if PK is not unique or can be null then row_number may have non-deterministic behavior and may produce different results from run to run.

    And If you do not have PK, this functionality can not be implemented because Hive may return differently ordered rows due to parallel execution, this will result in duplication of rows across batches, some rows can be lost.