I am using Impala JDBC driver to batch insert data into Impala. I currently have a batch size of 1000 and using INSERT INTO VALUES clause by PreparedStatement to execute batch queries. Impala Daemon in running on 3 machines and Impala Catalog Server, State store are running on the 4th machine.
The batch insert query plan on Impala looks like this:
Query Timeline: 35s398ms
- Query submitted: 0.000ns (0.000ns)
- Planning finished: 34s822ms (34s822ms)
- Submit for admission: 34s886ms (63.997ms)
- Completed admission: 34s886ms (0.000ns)
- Ready to start 1 fragment instances: 34s886ms (0.000ns)
- All 1 fragment instances started: 34s958ms (71.997ms)
- DML data written: 35s082ms (123.996ms)
- DML Metastore update finished: 35s286ms (203.993ms)
- Request finished: 35s298ms (11.999ms)
- Unregister query: 35s374ms (75.997ms)
- ComputeScanRangeAssignmentTimer: 0.000ns
As we can see, Planning finished is taking all the time. We have tried creating in both formats, PARQUET as well as normal. But everytime the Planning finished part is too high.
Is there any configuration change I need to do? Or am I doing something wrong?
The first thing to notice is that even if you are using a single PreparedStatement
in batches, each row will still get its own INSERT statement. For example, the prepared statement
INSERT INTO t VALUES (?, ?);
will insert two rows as
INSERT INTO t VALUES ('a', 1);
INSERT INTO t VALUES ('b', 2);
and not as
INSERT INTO t VALUES ('a', 1), ('b', 2);
The next thing to consider is that inserting a single row in Hadoop is very inefficient as a new HDFS file has to be created for each row.
An excerpt from the Impala documentation:
The
INSERT ... VALUES
technique is not suitable for loading large quantities of data into HDFS-based tables, because the insert operations cannot be parallelized, and each one produces a separate data file. Use it for setting up small dimension tables or tiny amounts of data for experimenting with SQL syntax, or with HBase tables. Do not use it for large ETL jobs or benchmark tests for load operations. Do not run scripts with thousands ofINSERT ... VALUES
statements that insert a single row each time. If you do runINSERT ... VALUES
operations to load data into a staging table as one stage in an ETL pipeline, include multiple row values if possible within eachVALUES
clause, and use a separate database to make cleanup easier if the operation does produce many tiny files.