Search code examples
pythonsqlitedata-processinglarge-data

Can I speed up a large data set operation in SQLite / Python?


I have a data set in the size range 1-5 billion 'box' objects stored in an SQLite database file in the format:

[x1,y1,z1,x2,y2,z2,box_id]

and currently I have an operation in a python script that does something like:

import sqlite3 as lite

box_data = lite.connect('boxes.db')
cur = box_data.cursor()
editor_cursor = box_data.cursor()

cur.execute("SELECT * FROM boxes")
while True:
    row = cur.fetchone()
    if row == None:
        break

    row_id = row[6]

    x1_normalized = int(round(row[0]/smallest_box_size))
    y1_normalized = int(round(row[1]/smallest_box_size))
    z1_normalized = int(round(row[2]/smallest_box_size))

    x2_normalized = int(round(row[3]/smallest_box_size))
    y2_normalized = int(round(row[4]/smallest_box_size))
    z2_normalized = int(round(row[5]/smallest_box_size))

    editor_cursor.execute("UPDATE boxes SET x1=?,y1=?,z1=?,x2=?,y2=?,z2=? WHERE id=?",(x1_normalized,y1_normalized,z1_normalized,x2_normalized,y2_normalized,z2_normalized,row_id))

where 'smallest box size' is just some float. It's a simple normalization task whereby basically each box coordinate must be converted from its 'physical' size to normalized integer coordinates.

Currently the process takes on the order of several hours, and I would like to reduce this operating time. Would it be possible to speed up this process in my current python-SQLite process?

Any suggestions on how to implement this process in another faster database program may also be helpful :)


Solution

  • Have SQLite do all the work for you instead:

    editor_cursor.execute("""
        UPDATE boxes SET x1=CAST(x1/:smallest_box_size as INTEGER),
                         y1=CAST(y1/:smallest_box_size as INTEGER),
                         z1=CAST(z1/:smallest_box_size as INTEGER),
                         x2=CAST(x2/:smallest_box_size as INTEGER),
                         y2=CAST(y2/:smallest_box_size as INTEGER),
                         z2=CAST(z2/:smallest_box_size as INTEGER)""", 
        {'smallest_box_size': smallest_box_size})
    

    In other words, SQLite is perfectly capable of normalizing all the rows for you without piping them all through Python.

    The CAST to INTEGER will already round a REAL value, no need to add an explicit round() call here.

    For future reference: you can loop over a result set by iterating over the cursor. No need to call .fetchone() for each row:

    cur.execute("SELECT * FROM boxes")
    for row in cur:
        # loop will terminate automatically when the rows are exhausted.
    

    This is implemented very efficiently, it only loads results as required to iterate.