Existing Database and Desired Outcome:
I have a larger SQLite database (12gb, tables with 44 million+ rows) that I would like to modify using Pandas in Python3.
Example Objective: I hope to read one of these large tables (44 million rows) into a DF in chunks, manipulate the DF chunk, and write the result to a new table. If possible, I would like to replace the new table if it exists, and append each chunk to it.
Because my manipulations only add or modify columns, the new table should have the same number of rows as the original table.
Issues:
The primary issue seems to stem from the following line in the below code:
df.to_sql(new_table, con=db, if_exists = "append", index=False)
Traceback (most recent call last):
File "example.py", line 23, in <module>
for df in df_generator:
File "/usr/local/lib/python3.5/site-packages/pandas/io/sql.py", line 1420, in _query_iterator
data = cursor.fetchmany(chunksize)
sqlite3.OperationalError: SQL logic error or missing database
If I then rerun the script, with the same new table name, it runs for each chunk, and an extra chunk, +1 row.
When the df.to_sql()
line is commented out, the loop runs for the expected number of chunks.
Test Example of Issue with Complete Code:
Complete Code: example.py
import pandas as pd
import sqlite3
#Helper Functions Used in Example
def ren(invar, outvar, df):
df.rename(columns={invar:outvar}, inplace=True)
return(df)
def count_result(c, table):
([print("[*] total: {:,} rows in {} table"
.format(r[0], table))
for r in c.execute("SELECT COUNT(*) FROM {};".format(table))])
#Connect to Data
db = sqlite3.connect("test.db")
c = db.cursor()
new_table = "new_table"
#Load Data in Chunks
df_generator = pd.read_sql_query("select * from test_table limit 10000;", con=db, chunksize = 5000)
for df in df_generator:
#Functions to modify data, example
df = ren("name", "renamed_name", df)
print(df.shape)
df.to_sql(new_table, con=db, if_exists = "append", index=False)
#Count if new table is created
try:
count_result(c, new_table)
except:
pass
1. Result when
#df.to_sql(new_table, con=db, if_exists = "append", index=False)
(the problem line is commented out):
$ python3 example.py
(5000, 22)
(5000, 22)
Which I expect since the example code limits my large table to 10k rows.
2. Result when
df.to_sql(new_table, con=db, if_exists = "append", index=False)
a. the problem line is not commented out
b. this is the first time the code is run with a new_table:
$ python3 example.py
(5000, 22)
Traceback (most recent call last):
File "example.py", line 23, in <module>
for df in df_generator:
File "/usr/local/lib/python3.5/site-packages/pandas/io/sql.py", line 1420, in _query_iterator
data = cursor.fetchmany(chunksize)
sqlite3.OperationalError: SQL logic error or missing database
3. Result when
df.to_sql(new_table, con=db, if_exists = "append", index=False)
a. the problem line is not commented out
b. the above code is run a second time with the new_table:
$ python3 example.py
(5000, 22)
(5000, 22)
(5000, 22)
(1, 22)
[*] total: 20,001 rows in new_table table
Thus, I have the issue of first the code breaking when run the first time (Result 2) and second, the total number of rows when run the second time (Result 3) is more than double what I expected.
Any suggestions on how I can solve this issue would be greatly appreciated.
You may try to specify:
db = sqlite3.connect("test.db", isolation_level=None)
# ----> ^^^^^^^^^^^^^^^^^^^^
Beside that you may try to increase your chunksize, because otherwise the time between commits is way to short for SQLite DB - that is causing this error, i guess... I would also recommend to use PostgreSQL, MySQL/MariaDB or something similar - they are much more reliable and appropriate for such DB size...