During the upload of a file I need to split its contents in lines, count the characters of each line and raise and error if they exceed a certain length.
class TheModel(models.Model):
upload_file = models.FileField(
upload_to='the/path'
)
class TheForm(forms.ModelForm):
def clean_upload_file(self):
the_file = self.cleaned_data.get('upload_file')
if the_file:
for chunk in upload_file.chunks(): # the file is huge
import ipdb; ipdb.set_trace()
The the_file
is opened in rb+
mode. The current part of its contents is:
>>>print(chunk)
b' counterproductive\nbishop\nsa raindrop\nsangu'
>>>print(the_file.mode)
'rb+'
It is obvious that the end of the byte is the beginning of a new line that will continue in the next iteration.
>>>print(chunk.splitlines())
[b' counterproductive', b'bishop', b'sa raindrop', b'sangu']
The above method will not help in telling whether the last entry is an entire line or not. On the other hand, \n
is not guaranteed to be a line separator for every uploaded file in binary mode.
If the new line character varies (may be \n
or \r\n
for example), how can I distinguish whether the last entry of the list represents the end of a line or just the first part of a new one?
UploadedFile.multiple_chunks()
gives the opportunity to drop the first and the last entry of the list provided by splitlines()
if the data are larger than 2,5 Megabytes which is the default.
validation_list = (
chunk.splitlines()[1:len(chunk.splitlines())-1]
if the_file.multiple_chunks()
else chunk.splitlines()
)
This way, only a few of the hundreds of thousand lines will be skipped in this preliminary check, keeping the loss ratio very low. This is better than risking a false alarm by validating a line that through the effort of reconstructing it from possible chunks scattered between iterations, may not be identical to the original.