Search code examples
javajdbcbatch-filejta

Pattern for batch query crawler operations


I am trying to create an abstraction for a batch query crawler operation. The idea is that a query is executed, a result set is obtained and for each row an operation is performed that either commits or rollbacks. The requirement is that all rows are processed independent of whether there are failures and that the result set is not loaded into memory beforehand.

The problem boils down to the fact that it is not possible to maintain an open result set after a rollback. This is as per the spec, cursor holdability is maintainable on commit (using ResultSet.HOLD_CURSORS_OVER_COMMIT), but not on rollback.

A naive implementation with JTA/JDBC semantics, providing two extension points, one for specifying the query and one for specifing the actual operations for each row, would be something like this:

UserTransaction tx = getUserTransaction();
tx.begin();
ResultSet rs = executeQuery(); //extension point
tx.commit();
while(rs.next()) {
    tx.begin();
    try {
        performOperationOnCurrentRow(ResultSet rs); //extension point
        tx.commit();
        logSuccess();
    }catch(Exception e) {
        tx.rollback();
        logFailure(e);
    }
}

This does not seem such a far-fetched scenario, however I've found very little relevant information on the web. The question is, has this been addressed elegantly by any of the popular frameworks? I do not necessarily need an out of the box solution, I just wonder if there is a known good/generally accepted approach to handle this scenario.

One solution would be to keep track of the row that failed and re-open a cursor after that point, which will generally require to enforce some extension rules (e.g. ordered result set, query using last failed row id on where clause etc).

Another would be to use two different Threads for the query and the row operation.

What would you do?


Solution

  • Since this has not been answered for a couple years now, I will go on and answer it myself.

    The solution we worked out revolves around the idea that the process (called BatchProcess) executes a Query (not limited to SQL, mind you) and adds its results to a concurrent Queue. The BatchProcess spawns a number of QueueProcessor objects (run on new Threads) that consume entries of the Queue and execute an Operation that uses the entry as input. Each Operation execution is performed as a single unit of work. The underlying transaction manager is a JTA implementation.

    A bit dated, but at some point the implementation was this https://bo2.googlecode.com/svn/trunk/Bo2ImplOpen/main/gr/interamerican/bo2/impl/open/runtime/concurrent/BatchProcess.java

    There is even a GUI for BatchProcess monitoring and management somewhere in that repo ;)

    Disclaimer: This guy had A LOT to do with this design and implementation https://stackoverflow.com/users/2362356/nakosspy

    UML diagrams for Batch Process objects