While working with batch insertion in jOOQ (v3.14.4) I noticed some inconsistency when looking into PostgreSQL (v12.6) logs.
When doing context.batch(<query>).bind(<1st record>).bind(<2nd record>)...bind(<nth record>).execute()
the logs show that the records are actually inserted one by one instead of all in one go.
While doing context.insert(<fields>).values(<1st record>).values(<2nd record>)...values(<nth record>)
actually inserts everything in one go judging by the postgres logs.
Is it a bug in the jOOQ itself or was I using the batch(...)
functionality incorrectly?
Here are 2 code snippets that are supposed to do the same but in reality, the first one inserts records one by one while the second one actually does the batch insertion.
public void batchInsertEdges(List<EdgesRecord> edges) {
Query batchQuery = context.insertInto(Edges.EDGES,
Edges.EDGES.SOURCE_ID, Edges.EDGES.TARGET_ID, Edges.EDGES.CALL_SITES,
Edges.EDGES.METADATA)
.values((Long) null, (Long) null, (CallSiteRecord[]) null, (JSONB) null)
.onConflictOnConstraint(Keys.UNIQUE_SOURCE_TARGET).doUpdate()
.set(Edges.EDGES.CALL_SITES, Edges.EDGES.as("excluded").CALL_SITES)
.set(Edges.EDGES.METADATA, field("coalesce(edges.metadata, '{}'::jsonb) || excluded.metadata", JSONB.class));
var batchBind = context.batch(batchQuery);
for (var edge : edges) {
batchBind = batchBind.bind(edge.getSourceId(), edge.getTargetId(),
edge.getCallSites(), edge.getMetadata());
}
batchBind.execute();
}
public void batchInsertEdges(List<EdgesRecord> edges) {
var insert = context.insertInto(Edges.EDGES,
Edges.EDGES.SOURCE_ID, Edges.EDGES.TARGET_ID, Edges.EDGES.CALL_SITES, Edges.EDGES.METADATA);
for (var edge : edges) {
insert = insert.values(edge.getSourceId(), edge.getTargetId(), edge.getCallSites(), edge.getMetadata());
}
insert.onConflictOnConstraint(Keys.UNIQUE_SOURCE_TARGET).doUpdate()
.set(Edges.EDGES.CALL_SITES, Edges.EDGES.as("excluded").CALL_SITES)
.set(Edges.EDGES.METADATA, field("coalesce(edges.metadata, '{}'::jsonb) || excluded.metadata", JSONB.class))
.execute();
}
I would appreciate some help to figure out why the first code snippet does not work as intended and second one does. Thank you!
There's a difference between "batch processing" (as in JDBC batch) and "bulk processing" (as in what many RDBMS call "bulk updates").
This page of the manual about data import explains the difference.
- Bulk size: The number of rows that are sent to the server in one SQL statement.
- Batch size: The number of statements that are sent to the server in one JDBC statement batch.
These are fundamentally different things. Both help improve performance. Bulk data processing does so by helping the RDBMS optimise resource allocation algorithms as it knows it is about to insert 10 records. Batch data processing does so by reducing the number of round trips between client and server. Whether either approach has a big impact on any given RDBMS is obviously vendor specific.
In other words, both of your approaches work as intended.