I have a vb.net application in which users must process several million records to store in a database (sql CE). The processing goes as follows:
If done sequentially, the first two items take about 30% of the time, and the last item takes about 70% of the time.
I thought it might be beneficial to do most of the database writes more like a batch process on a separate thread, hoping to cut the cost by (in an ideal world) the 30% required to do items 1 and 2. I tried doing this by storing the processed objects in a list, and whenever the list count exceeds a certain number calling an action on a separate thread, I save the data. Each time the data is saved, I actually save the data for the main object and several related child objects, i.e.,
cmd.CommandText = "INSERT INTO [MainObjectTable] (Data1, Data2, Data3) VALUES ( @Data1, @Data2, @Data3)"
cmd.ExecuteNonQuery()
cmd2.CommandText = "SELECT MAX(idx) FROM [MainObjectTable]"
MainObjectIdx = CInt(cmd2.ExecuteScalar())
'Loop over child objects and save each one to database'
cmd3.CommandText = "INSERT INTO [ChildObject] (MainObjectIdx, Data4, Data5, Data6) VALUES ( MainObjectIdx, @Data4, @Data5, @Data6 )"
for i = 0 to ChildList.Count -1
[Code to Update parameters for the command object]
cmd3.ExecuteNonQuery()
next
I lock the database to prevent multiple records from trying to save at once. I must do this (I think) because I use the main object's record key for further processing of a child object's data, as shown above.
The code essentially looks something like this:
Private sub CalcData()
Dim objectList as new List(of ParentObject)
For k = 0 to Calculations
'Load data, instantiate object'
Dim o as new ParentObject(IDs(k)) '<-- This pulls records from the sql ce database and loads the object'
o.calcData 'calculate the new data based on loaded data'
objectList.add(o) 'store in a list'
If objectList.Count > 1000 then
Batchsave(objectList)
objectList.clear()
End If
Next
End Sub
Private Sub BatchSave(objList As List(of ParentObject))
mTaskList.Add(Tasks.Task.Factory.StartNew(
Sub()
DBLock.EnterWriteLock()
Try
for j = 0 to objectList.count-1
[Code to update command object parameters, and save the object (and children) as discussed above where I show the sql statements]
next
Finally
DBLock.ExitWriteLock()
End Try
End Sub))
End Sub
I thought this scheme would maximize the performance, allowing the data saving to be done on a background thread. I structured the save like a batch process (for 1000 records at a time) because I have read that it's more efficient to parameterize the sql when updating many records. But the time reduction was not terribly impressive.
I also tried creating a new "Save" class to which I passed the data to be saved as it becomes available. The "Save" class handles creating a new tasks.task each time a parent object is passed to it, so I thought this would more or less create a continuous stream of objects to save on other threads rather than relying on saving every 1000 objects. Inside the "Save" class I have the following:
Public Class SaveData
Public Sub SaveBDLItem(ByVal o As ParentObject)
Tasks.Task.Factory.StartNew(
Sub()
Dim Object
mParentLock.EnterWriteLock()
Try
mcmd1.Parameters.Clear()
[code to add parameters to command object]
mcmd1.ExecuteNonQuery()
'get new key '
objectIDx= CInt(mcmd2.ExecuteScalar())
Finally
mBDLLock.ExitWriteLock()
End Try
'Now update children'
mChildLock.EnterWriteLock()
Try
For j = 0 To ParentObject.Children.Count - 1
mcmd3.Parameters.Clear()
[code to add parameters to command object]
mcmd3.ExecuteNonQuery()
Next
Finally
mChildLock.ExitWriteLock()
End Try
End Sub))
End Sub
.
.
.
End Class
However, this implementation was much slower than my previous attempt, and in fact seemed to be operating synchronously. Any idea why this method is actually slower?
I would also love feedback if there are other ideas of how to speed the whole process up.
Use the TableDirect APIs to avoid the overhead of the query processor. For Inserts and Selects, you will see tremendous speed gains. And remember the a sql ce database is a single file, so it is limited jow much you can do in parallel. I have a number of recent blog posts with deatils on how to use the TableDirect apis