I want to insert a file with 864k rows ~ 100MB into a clickhouse Database. I know about ways to insert a file with ClickHouse Client or Importing staright from ClickHouse cloud - both this methods taking only about 1.5s to insert this file. My whole point is to insert a lot of csv files into database fast as possible.
I've tried to insert the same file with Java through different ways:
public void insertData(List<String> data) {
try (PreparedStatement statement =
connection.prepareStatement(
"INSERT INTO btc_data SELECT * FROM input('col1 DateTime, col2 Float32, col3 Float32, "
+ "col4 Float32, col5 Float32, col6 Decimal(38,2), "
+ "col7 DateTime, col8 Float32, col9 Int32, col10 Decimal(38,2), "
+ "col11 Float32, col12 Int32')")) {
batchInsertData(data, statement);
statement.executeBatch();
} catch (SQLException e) {
throw new RuntimeException(e);
}
}
private void batchInsertData(List<String> data, PreparedStatement statement) {
try {
for (String str : data) {
String[] values = str.split(",");
LocalDateTime open_time =
LocalDateTime.ofInstant(
Instant.ofEpochMilli(Long.parseLong(values[0])), ZoneOffset.UTC);
LocalDateTime close_time =
LocalDateTime.ofInstant(
Instant.ofEpochMilli(Long.parseLong(values[6])), ZoneOffset.UTC);
statement.setObject(1, open_time);
statement.setFloat(2, Float.parseFloat(values[1]));
statement.setFloat(3, Float.parseFloat(values[2]));
statement.setFloat(4, Float.parseFloat(values[3]));
statement.setFloat(5, Float.parseFloat(values[4]));
statement.setBigDecimal(6, new BigDecimal(values[5]));
statement.setObject(7, close_time);
statement.setFloat(8, Float.parseFloat(values[7]));
statement.setInt(9, Integer.parseInt(values[8]));
statement.setBigDecimal(10, new BigDecimal(values[9]));
statement.setFloat(11, Float.parseFloat(values[10]));
statement.setInt(12, Integer.parseInt(values[11]));
statement.addBatch();
}
} catch (SQLException e) {
throw new RuntimeException(e);
}
}
This one takes about 15 seconds - 10x times slower to fully insert 100MB file into a database.
public void insertFromFile() {
try (ClickHouseClient client = ClickHouseClient.newInstance(server.getProtocol())) {
ClickHouseFile file =
ClickHouseFile.of(
"src/main/resources/864400.csv", ClickHouseCompression.NONE, ClickHouseFormat.CSV);
ClickHouseResponse response =
client
.write(server)
.set("format_csv_delimiter", ",")
.table("btc_data")
.data(file)
.executeAndWait();
ClickHouseResponseSummary summary = response.getSummary();
System.out.println(summary.getWrittenRows());
response.close();
} catch (ClickHouseException e) {
throw new RuntimeException(e);
}
}
This method takes about 28 seconds!!! - 20x times slower than native methods.
My overall question is how to speed up inserting with Java so I could insert as fast as native methods with Clickhouse Client or Cloud, maybe just twice slower, but not 10x times slower.
I've tried RDBC Driver - too slow.
full-disclosure: I work for ClickHouse and am not an expert with the Java clients. However, I discussed this with the Java dev team at ClickHouse. They made a couple of suggestions:
You may want to instrument some logging with timestamps to determine the amount of time being spent by Java bootstrapping vs. actual insertion time. This would provide a better comparison vs clickhouse-client
Adjusting the socket buffers for larger size can improve performance:
prop.setProperty("buffer_size", "8192");
prop.setProperty("socket_rcvbuf", "804800");
prop.setProperty("socket_sndbuf", "804800");
Hope this helps - please let me know