Search code examples
djangodatabasedjango-modelsormdjango-migrations

Efficiently updating a large number of records based on a field in that record using Django


I have about a million Comment records that I want to update based on that comment's body field. I'm trying to figure out how to do this efficiently. Right now, my approach looks like this:

update_list = []
qs = Comments.objects.filter(word_count=0)
for comment in qs:
    model_obj = Comments.objects.get(id=comment.id)
    model_obj.word_count = len(model_obj.body.split())
    update_list.append(model_obj)
Comment.objects.bulk_update(update_list, ['word_count'])

However, this hangs and seems to time out in my migration process. Does anybody have suggestions on how I can accomplish this?


Solution

  • It's not easy to determine the memory footprint of a Django object, but an absolute minimum is the amount of space needed to store all of its data. My guess is that you may be running out of memory and page-thrashing.

    You probably want to work in batches of, say, 1000 objects at a time. Use Queryset slicing, which returns another queryset. Try something like

    BATCH_SIZE = 1000 
    start = 0
    base_qs = Comments.objects.filter(word_count=0)
    
    while True:
        batch_qs = base_qs[ start: start+BATCH_SIZE ]
        start += BATCH_SIZE
        if not batch_qs.exists():
            break
    
        update_list = []
        for comment in batch_qs:
            model_obj = Comments.objects.get(id=comment.id)
            model_obj.word_count = len(model_obj.body.split())
            update_list.append(model_obj)
        Comment.objects.bulk_update(update_list, ['word_count'])
        print( f'Processed batch starting at {start}' )
    

    Each trip around the loop will free the space occupied by the previous trip when it replaces batch_qs and update_list. The print statement will allow you to watch it progress at a hopefully acceptable, regular rate!

    Warning - I have never tried this. I'm also wondering whether slicing and filtering will play nice with each other or whether one should use

    base_qs = Comments.objects.all()
    ...
    while True:
        batch_qs = base_qs[ start: start+BATCH_SIZE ]
        ....
        for comment in batch_qs.filter(word_count=0) : 
    

    so you are slicing your way though rows in the entire DB table and retrieving a subset of each slice that needs updating. This feels "safer". Anybody know for sure?