I'm executing a query on a table where I get a column with some hundred million rows back, this is because I want to plot them in a histogram. The problem is that this eats up nearly all my memory (7.8 gb) and all my swap memory (8gb) before the program exits with exit code -9 before cur.fetchall() is done.
How can I prevent this from happening? Should I sort my column first, then do several queries on chuncks of it - or is there maybe a better way to fetch the data in my query? The cur.execute itself takes almost no time.
#!/usr/bin/python
import sqlite3 as lite
import numpy as np
import sys
import os
import matplotlib.pyplot as plt
def getQuantity(databasepath):
con = lite.connect(databasepath)
binwidth = 1
start = time.time()
with con:
cur = con.cursor()
cur.execute('SELECT latitude FROM MessageType1')
con.commit()
latitudes = cur.fetchall() #Breakdown here
latitudes = [x[0] for x in latitudes]
plt.hist(latitudes, bins=range(int(min(latitudes)), int(max(latitudes)) + binwidth, binwidth))
plt.title("Bucket size: " + str(binwidth))
plt.ylabel("Number of message")
plt.savefig('latlongstats'+'t'+str(time.strftime("%H:%M:%S")), format='png')
if __name__ == "__main__":
getQuantity('database/database.db')
I found that if I replaced the following:
latitudes = cur.fetchall()
print "fetched"
latitudes = [x[0] for x in latitudes]
with:
while True:
tmp = cur.fetchone()
if tmp != None:
latitudes.append(tmp[0])
else:
break
I got the same result, allthough it takes forever, and only nearly eats up my ram (but not my swap).