Search code examples
pythonmysqlms-accesssqlalchemypyodbc

How to do this Not In operation without triggering an overflow in the marker amount of operations in pyobdc/sqlalchemy?


This is a simplification of the case: I have two databases, a MySQL and a MS_Access. I am trying to delete all elements from the MsAccess that are not in the MySQL table but are still in MSAccess. I am using sqlalchemy to connect to both DB. To connect with MSAccess (I know, this database should not be used anymore, this is actually part of a migration process), I am using sqlalchemy-access, that internally works with pyodbc.

The code that does this operation is:

#every row in the mysql table contains a field that references its correspondent row in msaccess
mysql_ids = mysql_session.query(mysql_table.id_msaccess).all()
list_of_ids = [elem(0) for elem in mysql_ids]

delete_query = delete(access_table).where((access_table).id.not_in(list_of_ids))
results = access_session.execute(delete_query)

However, I get this error message:

(pyodbc.ProgrammingError) ('The SQL contains -9972 parameter markers, but 55564 parameters were supplied) DELETE FROM [access_table] WHERE ([access_table].[id] NOT IN (?, ?, ... <here there are all the 55564 parameter markers>) parameters: (241, 242, 243,...)

I have found this issue in pyodbc's github page:

Github Issue in Pyodbc

They essentially say that there is a marker counter that overflows in the internal implementation. They are talking about SQL Server but I guess the same thing happens here.

I could do this query in blocks of 32768 rows, or otherwise check for every element from the mysql table to see if it is in the ms-access table (I think this would be quite slow) but I wonder if there is not a better approach. Do you have any suggestions on how could I approach this?

Thanks in advance for any suggestions


Solution

  • I could do this query in blocks of 32768 rows

    That won't work for a NOT IN query. Say you had a list of rows to keep:

    [1, 2, 3, 4, 5, 6]
    

    If you tried to do that in batches of 3 then the first DELETE would be

    DELETE FROM access_table WHERE id NOT IN (1, 2, 3)
    

    which would delete the rows with id values of 4, 5, and 6. Then the next DELETE would be

    DELETE FROM access_table WHERE id NOT IN (4, 5, 6)
    

    which would delete the rows with id values of 1, 2, and 3.

    However, you could build a list of rows to delete like this:

    with mysql_engine.begin() as conn:
        mysql_existing = (
            conn.scalars(sa.select(mysql_table.c.id_msaccess)).all()
        )
        print(mysql_existing)  # [2, 3]
    
    with access_engine.begin() as conn:
        access_existing = (
            conn.scalars(sa.select(access_table.c.id)).all()
        )
        print(access_existing)  # [1, 2, 3, 4, 5, 6]
    
    access_to_delete = list(set(access_existing).difference(mysql_existing))
    print(access_to_delete)  # [1, 4, 5, 6]
    

    and you could process that list in batches by using IN instead of NOT IN.