Search code examples
pythonbatch-insertsinglestore

Implement batch insert to improve performance


I have written the following code to insert data to MEMSql which has almost similar syntax like in MySQL.

def get_connection(db=DATABASE):
    """ Returns a new connection to the database. """
    return database.connect(host=HOST, port=PORT, user=USER, password=PASSWORD, database=db)



def insert_data(data):
    print 'inserting data...'

    for item in data:
        vars_to_sql = []
        keys_to_sql = []
        print(item)
        for key,value in item.iteritems():
             if key == '__osHeaders':
                value = str(value)
             if isinstance(value, unicode):
                vars_to_sql.append(value.encode('ascii', 'ignore'))
                keys_to_sql.append(key.encode('ascii', 'ignore'))
             else:
                vars_to_sql.append(value)
                keys_to_sql.append(key)

        keys_to_sql = ', '.join(keys_to_sql)
        with get_connection() as conn:

             c = conn.execute("INSERT INTO tablename (%s) VALUES %r" % (keys_to_sql,  tuple(vars_to_sql),))
             print c

The field names can not be hard coded since they might change according to the data I get from the other end. Any way its a dictionary I'm iterating here. Since this single insertion is very slow I need to take batch size as a variable, form the query statement and insert it accordingly. So the query for batch size of 2 will be INSERT INTO tablename col1, col2 VALUES ('a', 'b'),('c','d')

Please help me how to introduce it here.


Solution

  • If you are using the MemSQL Python library you can use the multi_insert helper provided in the memsql.common.query_builder package. For example:

    from memsql.common.query_builder import multi_insert
    from memsql.common.database import connect
    
    sql, params = multi_insert("my_table", { "foo": 1 }, { "foo": 2 })
    # sql = 'INSERT INTO `my_table` (`foo`) VALUES (%(_QB_ROW_0)s), (%(_QB_ROW_1)s)'
    # params = {'_QB_ROW_0': [1], '_QB_ROW_1': [2]}
    
    with connect(...) as conn:
        conn.execute(sql, **params)
    

    Note that multi_insert requires that each record has the same set of columns defined since it translates it to a tuple based insert for the query.