I am attempting to select from (with a WHERE
clause) and sort a large database table in sqlite3 via python. The sort is currently taking 30+ minutes on about 36 MB of data. I have a feeling it can work faster than this with indices, but I think the order of my code may be incorrect.
The code is executed in the order listed here.
My CREATE TABLE
statement looks like this:
c.execute('''CREATE table gtfs_stop_times (
trip_id text , --REFERENCES gtfs_trips(trip_id),
arrival_time text, -- CHECK (arrival_time LIKE '__:__:__'),
departure_time text, -- CHECK (departure_time LIKE '__:__:__'),
stop_id text , --REFERENCES gtfs_stops(stop_id),
stop_sequence int NOT NULL --NOT NULL
)''')
The rows are then inserted in the next step:
stop_times = csv.reader(open("tmp\\avl_stop_times.txt"))
c.executemany('INSERT INTO gtfs_stop_times VALUES (?,?,?,?,?)', stop_times)
Next, I create an index out of two columns (trip_id
and stop_sequence
):
c.execute('CREATE INDEX trip_seq ON gtfs_stop_times (trip_id, stop_sequence)')
Finally, I run a SELECT
statement with a WHERE
clause that sorts this data by the two columns used in the index and then write that to a csv file:
c.execute('''SELECT gtfs_stop_times.trip_id, gtfs_stop_times.arrival_time, gtfs_stop_times.departure_time, gtfs_stops.stop_id, gtfs_stop_times.stop_sequence
FROM gtfs_stop_times, gtfs_stops
WHERE gtfs_stop_times.stop_id=gtfs_stops.stop_code
ORDER BY gtfs_stop_times.trip_id, gtfs_stop_times.stop_sequence;
)''')
f = open("gtfs_update\\stop_times.txt", "w")
writer = csv.writer(f, dialect = 'excel')
writer.writerow([i[0] for i in c.description]) # write headers
writer.writerows(c)
del writer
Is there any way to speed up Step 4 (possibly be changing how I add and/or use the index) or should I just go to lunch while this runs?
I have added PRAGMA statements to try to improve performance to no avail:
c.execute('PRAGMA main.page_size = 4096')
c.execute('PRAGMA main.cache_size=10000')
c.execute('PRAGMA main.locking_mode=EXCLUSIVE')
c.execute('PRAGMA main.synchronous=NORMAL')
c.execute('PRAGMA main.journal_mode=WAL')
c.execute('PRAGMA main.cache_size=5000')
The SELECT
executes extremely fast because there is no gtfs_stops
table and you get nothing but an error message.
If we assume that there is a gtfs_stops
table, then your trip_seq
index is already quite optimal for the query.
However, you also need an index for looking up stop_code
values in the gtfs_stops
column.