I have a data set in the size range 1-5 billion 'box' objects stored in an SQLite database file in the format:
[x1,y1,z1,x2,y2,z2,box_id]
and currently I have an operation in a python script that does something like:
import sqlite3 as lite
box_data = lite.connect('boxes.db')
cur = box_data.cursor()
editor_cursor = box_data.cursor()
cur.execute("SELECT * FROM boxes")
while True:
row = cur.fetchone()
if row == None:
break
row_id = row[6]
x1_normalized = int(round(row[0]/smallest_box_size))
y1_normalized = int(round(row[1]/smallest_box_size))
z1_normalized = int(round(row[2]/smallest_box_size))
x2_normalized = int(round(row[3]/smallest_box_size))
y2_normalized = int(round(row[4]/smallest_box_size))
z2_normalized = int(round(row[5]/smallest_box_size))
editor_cursor.execute("UPDATE boxes SET x1=?,y1=?,z1=?,x2=?,y2=?,z2=? WHERE id=?",(x1_normalized,y1_normalized,z1_normalized,x2_normalized,y2_normalized,z2_normalized,row_id))
where 'smallest box size' is just some float. It's a simple normalization task whereby basically each box coordinate must be converted from its 'physical' size to normalized integer coordinates.
Currently the process takes on the order of several hours, and I would like to reduce this operating time. Would it be possible to speed up this process in my current python-SQLite process?
Any suggestions on how to implement this process in another faster database program may also be helpful :)
Have SQLite do all the work for you instead:
editor_cursor.execute("""
UPDATE boxes SET x1=CAST(x1/:smallest_box_size as INTEGER),
y1=CAST(y1/:smallest_box_size as INTEGER),
z1=CAST(z1/:smallest_box_size as INTEGER),
x2=CAST(x2/:smallest_box_size as INTEGER),
y2=CAST(y2/:smallest_box_size as INTEGER),
z2=CAST(z2/:smallest_box_size as INTEGER)""",
{'smallest_box_size': smallest_box_size})
In other words, SQLite is perfectly capable of normalizing all the rows for you without piping them all through Python.
The CAST
to INTEGER
will already round a REAL
value, no need to add an explicit round()
call here.
For future reference: you can loop over a result set by iterating over the cursor. No need to call .fetchone()
for each row:
cur.execute("SELECT * FROM boxes")
for row in cur:
# loop will terminate automatically when the rows are exhausted.
This is implemented very efficiently, it only loads results as required to iterate.