Search code examples
jdbcprepared-statementclouderaimpala

Batch Insert into Impala taking too much time


I am using Impala JDBC driver to batch insert data into Impala. I currently have a batch size of 1000 and using INSERT INTO VALUES clause by PreparedStatement to execute batch queries. Impala Daemon in running on 3 machines and Impala Catalog Server, State store are running on the 4th machine.

The batch insert query plan on Impala looks like this:

Query Timeline: 35s398ms
   - Query submitted: 0.000ns (0.000ns)
   - Planning finished: 34s822ms (34s822ms)
   - Submit for admission: 34s886ms (63.997ms)
   - Completed admission: 34s886ms (0.000ns)
   - Ready to start 1 fragment instances: 34s886ms (0.000ns)
   - All 1 fragment instances started: 34s958ms (71.997ms)
   - DML data written: 35s082ms (123.996ms)
   - DML Metastore update finished: 35s286ms (203.993ms)
   - Request finished: 35s298ms (11.999ms)
   - Unregister query: 35s374ms (75.997ms)
 - ComputeScanRangeAssignmentTimer: 0.000ns

As we can see, Planning finished is taking all the time. We have tried creating in both formats, PARQUET as well as normal. But everytime the Planning finished part is too high.

Is there any configuration change I need to do? Or am I doing something wrong?


Solution

  • The first thing to notice is that even if you are using a single PreparedStatement in batches, each row will still get its own INSERT statement. For example, the prepared statement

    INSERT INTO t VALUES (?, ?);
    

    will insert two rows as

    INSERT INTO t VALUES ('a', 1);
    INSERT INTO t VALUES ('b', 2);
    

    and not as

    INSERT INTO t VALUES ('a', 1), ('b', 2);
    

    The next thing to consider is that inserting a single row in Hadoop is very inefficient as a new HDFS file has to be created for each row.

    An excerpt from the Impala documentation:

    The INSERT ... VALUES technique is not suitable for loading large quantities of data into HDFS-based tables, because the insert operations cannot be parallelized, and each one produces a separate data file. Use it for setting up small dimension tables or tiny amounts of data for experimenting with SQL syntax, or with HBase tables. Do not use it for large ETL jobs or benchmark tests for load operations. Do not run scripts with thousands of INSERT ... VALUES statements that insert a single row each time. If you do run INSERT ... VALUES operations to load data into a staging table as one stage in an ETL pipeline, include multiple row values if possible within each VALUES clause, and use a separate database to make cleanup easier if the operation does produce many tiny files.