I have a two tables that have around 1M rows and in one of them, I've added a new migration that adds two new columns to one of them and that requires some computation to add their values.
What I did was a command that fetches all rows in that table, and use Paginator with a chunk size of 1000, and after adding the new values, I perform the bulk_update.
The problem is that only the odd pages are updated, the even ones stay with the null value. I start with 1000 pages to update, then I need to run again only to update 500 pages, and so on. I've also tested with chunks of 10, and the problem persists. Is there something wrong with this code?
def add_values_to_columns(foo):
foo.col1=...
foo.col2=...
class Command(BaseCommand):
def handle(self, *args, **options):
try:
queryset = Foo.objects.all() \
.filter(col1__isnull=True, col2__isnull=True) \
.order_by("id") \
.select_related("bar")
chunk_size = 1000
paginator = Paginator(queryset, chunk_size)
pages = paginator.page_range
number_pages = pages[-1]
for page_number in pages:
page = paginator.page(page_number)
updated_rows = []
for foo in page.object_list:
add_values_to_columns(foo)
updated_rows.append(foo)
Foo.objects.bulk_update(updated_rows, ["col1", "col2"])
except Exception as e:
self.stdout.write(e)
The paginator is lazy, in fact that is the main use-case of a paginator: to retrieve a subset of the record, when needed.
As a result if you thus would have a page size of 10, you obtain the first ten items, update these such that these no longer satisfy the condition of your paginator, and you obtain the next page, you thus will skip 10 elements, but these will not be the same elements as the previous page: since you updated some rows, the first ten items in the paginator will have changed, and thus it will indeed skip ten items.
We can work with an iterator
class Command(BaseCommand):
def handle(self, *args, **options):
chunk_size = 1000
queryset = (
Foo.objects.all()
.filter(col1__isnull=True, col2__isnull=True)
.order_by('id')
.select_related('bar')
.iterator(chunk_size=chunk_size)
)
updated_rows = []
for foo in queryset:
add_values_to_columns(foo)
updated_rows.append(foo)
if len(updated_rows) > chunk_size:
Foo.objects.bulk_update(updated_rows, ['col1', 'col2'])
updated_rows = []
this will thus work with a cursor to fetch records, and collect objects to update.