Search code examples
djangopostgresqloptimizationmany-to-manymanytomanyfield

Django database access optimization: Efficient creation of many-to-many relationships (between existing objects)


I'm using Django 2.2 with a PostgreSQL database.

I have two models: Gene and Annotation and need to create and link (many-to-many) thousands of Genes and Annotations at the same time.

class Gene(models.Model):
    identifier = models.CharField(max_length=50, primary_key=True)
    annotation = models.ManyToManyField(Annotation)
class Annotation(models.Model):
    name = models.CharField(max_length=120, unique=True, primary_key=True)

I already found a way to create the objects very efficiently:

Gene.objects.bulk_create([Gene(identifier=identifier) for identifier in gene_id_set])

This is my Django-docs-inspired way to create relationships:

relationships = {
    'gene1': ['anno1', 'anno2'],
    'gene2': ['anno3'],
    ...
}

for gene in relationships:
    gene = Annotation.objects.get(pk='gene1')
    gene.annotation_set.set([Annotation.objects.get(pk=anno) for anno in relationships[gene])

But this is very clumsy: It hits the database 4 times! Is there not a better way, using Django-built-in-tools or raw SQL queries?

The many-to-many table (myapp_gene_annotation) looks like this:

id gene_id   annotation_id
1  gene1       anno1
2  gene1       anno2
3  gene2       anno3
...

Solution

  • Now we can create Gene_annotation objects: the implicit model Django has constructed for the ManyToMany table, like:

    through_model = Gene.annotation.through
    
    objs = [
        through_model(gene_id=gene_id, annotation_id=anno_id)
        for gene_id, rels in relationships.items()
        for anno_id in rels
    ]

    Now we can perform a bulk insert in the table of the through_model:

    through_model.objects.bulk_create(objs)

    You should of course only add the relations after you have added the Genes and Annotations, since otherwise, the foreign key constraints at the database side will raise an error.

    We will here insert all the relations in one time. If the table is huge, this might result in multiple queries, but still it is more efficient than querying once per relation.