Search code examples
pythondjangomany-to-many

Proper way to bulk_create for ManyToMany field, Django?


I have this code for table populating.

def add_tags(count):
    print "Add tags"
    insert_list = []
    photo_pk_lower_bound = Photo.objects.all().order_by("id")[0].pk
    photo_pk_upper_bound = Photo.objects.all().order_by("-id")[0].pk
    for i in range(count):
        t = Tag( tag = 'tag' + str(i) )
        insert_list.append(t)
    Tag.objects.bulk_create(insert_list)
    for i in range(count):
        random_photo_pk = randint(photo_pk_lower_bound, photo_pk_upper_bound)
        p = Photo.objects.get( pk = random_photo_pk )
        t = Tag.objects.get( tag = 'tag' + str(i) )
        t.photos.add(p)

And this is the model:

class Tag(models.Model):
    tag = models.CharField(max_length=20,unique=True)
    photos = models.ManyToManyField(Photo)

As I understand this answer : Django: invalid keyword argument for this function I have to save tag objects first (due to ManyToMany field) and then attach photos to them through add(). But for large count this process takes too long. Are there any ways to refactor this code to make it faster?

In general I want to populate Tag model with random dummy data.

EDIT 1 (model for photo)

class Photo(models.Model):
    photo = models.ImageField(upload_to="images")
    created_date = models.DateTimeField(auto_now=True)
    user = models.ForeignKey(User)

    def __unicode__(self):
       return self.photo.name

Solution

  • TL;DR Use the Django auto-generated "through" model to bulk insert m2m relationships.

    "Tag.photos.through" => Django generated Model with 3 fields [ id, photo, tag ]
    photo_tag_1 = Tag.photos.through(photo_id=1, tag_id=1)
    photo_tag_2 = Tag.photos.through(photo_id=1, tag_id=2)
    Tag.photos.through.objects.bulk_insert([photo_tag_1, photo_tag_2, ...])
    

    This is the fastest way that I know of, I use this all the time to create test data. I can generate millions of records in minutes.

    Edit from Georgy:

    def add_tags(count):
        Tag.objects.bulk_create([Tag(tag='tag%s' % t) for t in range(count)])
    
        tag_ids = list(Tag.objects.values_list('id', flat=True))
        photo_ids = Photo.objects.values_list('id', flat=True)
        tag_count = len(tag_ids)
           
        for photo_id in photo_ids:
            tag_to_photo_links = []
            shuffle(tag_ids)
    
            rand_num_tags = randint(0, tag_count)
            photo_tags = tag_ids[:rand_num_tags]
    
            for tag_id in photo_tags:
                # through is the model generated by django to link m2m between tag and photo
                photo_tag = Tag.photos.through(tag_id=tag_id, photo_id=photo_id)
                tag_to_photo_links.append(photo_tag)
    
            Tag.photos.through.objects.bulk_create(tag_to_photo_links, batch_size=7000)
    

    I didn't create the model to test, but the structure is there you might have to tweaks some stuff to make it work. Let me know if you run into any problems.

    [edited]