My problem is essentially this: When I try to use to_sql with if_exists = 'append' and name is set to a table on my SQL Server that already exists python crashes.
This is my code:
@event.listens_for(engine, 'before_cursor_execute') def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
if executemany:
cursor.fast_executemany = True
df.to_sql(name = 'existingSQLTable', con = engine, if_exists = 'append', index = False, chunksize = 10000, dtype = dataTypes)
I didn't include it but dataTypes is a dictionary of all the column names and their data type.
This is the error I get:
Traceback (most recent call last):
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1116, in _execute_context
context)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 447, in do_executemany
cursor.executemany(statement, parameters)
pyodbc.IntegrityError: ('23000', "[23000] [Microsoft][SQL Server Native Client 11.0][SQL Server]Violation of PRIMARY KEY constraint 'PK__existingSQLTable__'. Cannot insert duplicate key in object 'dbo.existingSQLTable'. The duplicate key value is (20008.7, 2008-08-07, Fl). (2627) (SQLExecute); [23000] [Microsoft][SQL Server Native Client 11.0][SQL Server]The statement has been terminated. (3621)")
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<pyshell#24>", line 1, in <module>
Table.to_sql(name = 'existingSQLTable', con = engine, if_exists = 'append', index = False, chunksize = 10000, dtype = dataTypes)
File "C:\Apps\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1165, in to_sql
chunksize=chunksize, dtype=dtype)
File "C:\Apps\Anaconda3\lib\site-packages\pandas\io\sql.py", line 571, in to_sql
chunksize=chunksize, dtype=dtype)
File "C:\Apps\Anaconda3\lib\site-packages\pandas\io\sql.py", line 1250, in to_sql
table.insert(chunksize)
File "C:\Apps\Anaconda3\lib\site-packages\pandas\io\sql.py", line 770, in insert
self._execute_insert(conn, keys, chunk_iter)
File "C:\Apps\Anaconda3\lib\site-packages\pandas\io\sql.py", line 745, in _execute_insert
conn.execute(self.insert_statement(), data)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 914, in execute
return meth(self, multiparams, params)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\sql\elements.py", line 323, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1010, in _execute_clauseelement
compiled_sql, distilled_params
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1146, in _execute_context
context)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1341, in _handle_dbapi_exception
exc_info
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 202, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 185, in reraise
raise value.with_traceback(tb)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1116, in _execute_context
context)
File "C:\Apps\Anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 447, in do_executemany
cursor.executemany(statement, parameters)
Based on the errors, to me it appears that there's something wrong with the flag fast_executemany, but I've read a lot of documentation on it, and don't see anything wrong with it.
Things that may be of note:
The obvious solution to me was to break the DataFrame up into chunks of 900,000 rows. While the first chunk is successfully uploaded, I cannot append even a single row to it without python crashing.
Is this error a result of the code meant to speed up the process (which it does fantastically)? Am I misunderstanding the to_sql function? Or is there something else going on? Any suggestions would be great! Also, if anyone has a similar problem it would be great to know!
As @Jon Clements explained, the problem was that there were rows which had identical primary keys (but the rows weren't themselves identical). I used the pandas df.drop_duplicates function, with the subset parameter set to the primary key columns. This solved the Violation of PK error.