Search code examples
c++bulkinsertvoltdb

VoltDB pass execute multiple inserts in one invoke, C++ API


I currently have a model where a large number of inserts need to be done (not at startup) on the same table. For the time being I am preparing the insert values set inside the C++ code and then calling the insert stored procedure individually.

e.g.

INSERT ... VALUES ('1','2')
INSERT ... VALUES ('3','4')
INSERT ... VALUES ('5','6')

I would like to know if it is possible (using VoltDB and the C++ client) to either:

1) Do bulk inserts e.g.

INSERT ... VALUES ('1','2'), ('3','4'), ('5','6')

or

2) Pass an array or a string containing a custom delimiter into the stored procedure, then parse it inside and call the individual inserts inside the stored procedure itself.

INSERT ... VALUES ('1,2|3,4|5,6') or similar

then split the string inside the procedure.

If either is possible, could you please point me either to an example, or to the C++ API syntax that would facilitate the implementation? (e.g. looping in stored procedure, in order to parse the string and/or string manipulation functions, etc.)

I would like to try one of these options, in order to test the relative performance. Although I've read that individual inserts should be fast enough, I would think this can differ based on the use case.


Solution

  • Individual inserts would be faster if you called the default insert procedure for the table, e.g. "TABLENAME.insert", which takes the same values as INSERT ... VALUES, but bypasses the AdHoc SQL parser and is routed more directly to the partition. That will give you the best performance to insert records using an individual procedure call for each row.

    On the java client, there is an API that facilitates bulk loading of a table. There is an example tutorial for it here: https://github.com/VoltDB/voltdb/tree/master/examples/HOWTOs/bulkloader

    If the data exists in a CSV or delimited file, you could leverage the csvloader application, which uses the same bulkloader API.

    The C++ client does not have an implementation of the bulkloader API, so while it's not impossible, it would be a lot more difficult.

    Bulk inserts in the form of INSERT ... VALUES ('1','2'),('3','4'),... are not supported by VoltDB.

    The other approach you describe is possible. You could write a Java stored procedure that takes a VoltTable as input parameter, and from the C++ client build a Table object, which corresponds to the VoltTable in Java. Or, you could pass in arrays of values. However, neither the VoltTable or an array can be the partitioning key parameter for the procedure. So if you are trying to do something high scale, you would want to have a separate parameter value for the partition key, and you would need to send a set of records that all belong in the same partition. That can be difficult to do. The easiest way is if you write your own simple hashing function. As you generate or receive new records, you can hash them with your function and group them into buckets, then send these sets of records to the database in bulk, with the hash value as the partition key. But you would have to include a column in the table for this hash value. Records that have the same hash value would therefore belong in the same partition.

    Disclosure: I work at VoltDB.