python django python-3.x django-forms django-file-upload

Django clean: Determine the new line character in TemporaryFileUploadHandler chunk

During the upload of a file I need to split its contents in lines, count the characters of each line and raise and error if they exceed a certain length.

class TheModel(models.Model):
    upload_file = models.FileField(
        upload_to='the/path'
    )


class TheForm(forms.ModelForm):
    def clean_upload_file(self):
        the_file = self.cleaned_data.get('upload_file')
        if the_file:
            for chunk in upload_file.chunks():  # the file is huge
                import ipdb; ipdb.set_trace()

The the_file is opened in rb+ mode. The current part of its contents is:

>>>print(chunk)
b' counterproductive\nbishop\nsa raindrop\nsangu'
>>>print(the_file.mode)
'rb+'

It is obvious that the end of the byte is the beginning of a new line that will continue in the next iteration.

>>>print(chunk.splitlines())
[b' counterproductive', b'bishop', b'sa raindrop', b'sangu']

The above method will not help in telling whether the last entry is an entire line or not. On the other hand, \n is not guaranteed to be a line separator for every uploaded file in binary mode.

If the new line character varies (may be \n or \r\n for example), how can I distinguish whether the last entry of the list represents the end of a line or just the first part of a new one?

Solution

UploadedFile.multiple_chunks() gives the opportunity to drop the first and the last entry of the list provided by splitlines() if the data are larger than 2,5 Megabytes which is the default.

validation_list = (
    chunk.splitlines()[1:len(chunk.splitlines())-1] 
    if the_file.multiple_chunks()
    else chunk.splitlines()
)

This way, only a few of the hundreds of thousand lines will be skipped in this preliminary check, keeping the loss ratio very low. This is better than risking a false alarm by validating a line that through the effort of reconstructing it from possible chunks scattered between iterations, may not be identical to the original.