Search code examples
pythonsqlalchemy

Python3 SQLAlchemy delete duplicates


I'm using SQLAlchemy to manage a database and I'm trying to delete all rows that contain duplicates. The table has an id (primary key) and domain name.

Example:
ID| Domain
1 | example-1.com
2 | example-2.com
3 | example-1.com

In this case I want to delete 1 instance of example-1.com. Sometimes I will need to delete more than 1 but in general the database should not have a domain more than once and if it does, only the first row should be kept and the others should be deleted.


Solution

  • Assuming your model looks something like this:

    import sqlalchemy as sa
    from sqlalchemy import orm
    
    Base = orm.declarative_base()
    
    
    class Domain(Base):
        __tablename__ = 'domain_names'
    
        id = sa.Column(sa.Integer, primary_key=True)
        domain = sa.Column(sa.String)
    

    Then you can delete the duplicates like this:

    # Create a query that identifies the row for each domain with the lowest id
    inner_q = session.query(sa.func.min(Domain.id)).group_by(Domain.domain)
    aliased = sa.alias(inner_q)
    # Select the rows that do not match the subquery
    q = session.query(Domain).filter(~Domain.id.in_(aliased))
    
    # Delete the unmatched rows (SQLAlchemy generates a single DELETE statement from this loop)
    for domain in q:
        session.delete(domain)
    session.commit()
    
    # Show remaining rows
    for domain in session.query(Domain):
        print(domain)
    print()
    

    If you are not using the ORM, the core equivalent is:

    meta = sa.MetaData()
    domains = sa.Table('domain_names', meta, autoload=True, autoload_with=engine)
    
    inner_q = sa.select([sa.func.min(domains.c.id)]).group_by(domains.c.domain)
    aliased = sa.alias(inner_q)
    
    with engine.connect() as conn:
        conn.execute(domains.delete().where(~domains.c.id.in_(aliased)))
    

    This answer is based on the SQL provided in this answer. There are other ways of deleting duplicates, which you can see in the other answers on the link, or by googling "sql delete duplicates" or similar.