Search code examples
databasejetextensible-storage-engine

Inserting a million small records into Extensible Storage Engine (JetBlue) - quickly


I'm hoping Laurion Burchall reads this :-)

I need to insert a million tiny records as quickly as possible.

Right now I'm in a very tight loop where, for every record, I

a) start a transaction  (JetBeginTransaction)
b) prepare an update (JetPrepareUpdate)
c) add the row (JetSetColumns)
d) commit the transaction (JetCommitTransaction)

Right now, during this process, I'm in a tight loop on one processor. The target machine has multiple CPUs, great disks, and lots of free RAM.

I'm wondering how to get better performance.

As far as transactions go, I did some experiments and had troubles where errors came back if I put too much data in one transaction. I'd like to better understand what's going on there - do I have a bug, or is the size of a transaction capped, if capped can I enlarge the cap? I'm only investigating this because I'm guessing that a transaction give ESE the ability to do more caching in RAM, minimzing disk flushes? - this is just a guess?

In general how do I make use of multiple processors/lots of RAM/and nice disks? do I open the database twice and go from there? I'm not quite sure what happens with respect to thread safety and transactions. If I have two handles to the DB, each in a transaction, will a write on one handle be available for the second immediately, before the commit, or will I need to commit first?

any tips are appreciated

    here are the constraints    

a) I've got a million records that need to be written into the DB as fast as possible
b) to fully generate the record for insertion there are two searches that need to occur within the same table (seeking keys)
c) This is a rebuild/regeneration of the DB - it either worked, or it didnt.  
   If it didnt there is no going back, a fresh rebuild/regeneration is
   needed.  I cannot restart mid process and without all the data none of 
   the data is valuable.  READ: having one big transaction is fine if it 
   improves perf.  I'd like ESE to cache, in ram, if that helps perf.

thanks!


Solution

  • For single-threaded performance the most important thing to look at is your transaction model.

    If you have tried putting more data in one transaction and it failed you probably got a JET_errOutOfVersionStore. Esent has to track undo information for all operations performed in a transaction (to enable rollback) and that information is stored in the version store. The default size of the version store is quite small. You can increase it with the JET_paramMaxVerPages system parameter. A value of 1024 (64MB of version store) will enable quite large transactions. I suggest doing 100-1000 insertions per transaction.

    When you call JetCommitTransaction Esent will flush the log to disk, generating a synchronous I/O. To avoid that pass JET_bitCommitLazyFlush to JetCommitTransaction. Your transactions will still be atomic but not durable in the case of a crash (things will be fine if you exit normally). It looks like that should be find for your use.

    If you are inserting records in ascending order then you might be able to get away with a single-threaded application. If you can change your implementation to do sequential inserts you should -- they are a lot faster. For random inserts multiple threads can be useful. To use multiple threads you just need to create new sessions (JetBeginSession) and have them open the database (JetOpenDatabase). Esent uses snapshot isloation (http://en.wikipedia.org/wiki/Snapshot_isolation) so cannot see modifications made by other sessions which aren't committed or commit after your transaction begins. This is different than read-committed where you can see changes once another session commits. You might need to think about how to divide up the work to deal with this.