Search code examples
pythondjangodjango-viewsbulk

Django : bulk upload with confirmation


Yet another question about the style and the good practices. The code, that I will show, works and do the functionality. But I'd like to know is it ok as solution or may be it's just too ugly?

As the question is a little bit obscure, I will give some points at the end.

So, the use case.

I have a site with the items. There is a functionality to add the item by user. Now I'd like a functionality to add several items via a csv-file.

How should it works?

  1. User go to special upload page.
  2. User choose a csv-file, click upload.
  3. Then he is redirected to the page that show the content of csv-file (as a table).
  4. If it's ok for user, he clicks "yes" (button with "confirm_items_upload" value) and the items from file are added to database (if they are ok).

I saw already examples for bulk upload for django, and they seem pretty clear. But I don't find an example with an intermediary "verify-confirm" page. So how I did it :

  1. in views.py : view for upload csv-file page
def upload_item_csv_file(request):
    if request.method == 'POST':
        form = UploadItemCsvFileForm(request.POST, request.FILES)
        if form.is_valid():
            uploaded_file_name = handle_uploaded_item_csv_file(request.FILES['item_csv_file'])
            request.session['uploaded_file'] = uploaded_file_name
            return redirect('show_upload_csv_item')
    else:
        form = UploadItemCsvFileForm()
    return render(request, 'myapp/item_csv_upload.html', {'form': form})
  1. in utils.py : handle_uploaded_item_csv_file - just save the file and return a file-name
def handle_uploaded_item_csv_file(f):
    now = datetime.now()
    # YY_mm_dd_HH_MM
    dt_string = now.strftime("%Y_%m_%d_%H_%M")
    file_name = os.path.join(settings.MEDIA_ROOT, f"tmp_csv/item_csv_{dt_string}.csv")
    with open(file_name, 'wb+') as destination:
        for chunk in f.chunks():
            destination.write(chunk)

    return f"tmp_csv/item_csv_{dt_string}.csv"
  1. in views.py : view for show_upload_csv_item
@transaction.atomic
def show_uploaded_file(request):
    if 'uploaded_file' in request.session :
        file_name = request.session['uploaded_file']
    else :
        print("Something wrong : raise 404")
        raise Http404
    if not os.path.isfile(os.path.join(settings.MEDIA_ROOT, file_name)):
        print("Something wrong, file does not exist : raise 404")
        raise Http404

    with open(os.path.join(settings.MEDIA_ROOT, file_name)) as csvfile :
        fieldnames = ['serial_number', 'type', 'shipping_date', 'comments']
        csv_reader = csv.DictReader(csvfile, delimiter=';', fieldnames=fieldnames)
        list_items = list(csv_reader)

    if request.POST and ("confirm_items_upload" in request.POST) :
        if request.POST["confirm_items_upload"] == "yes" :
            for cur_item in list_items :
                if not cur_item['shipping_date'] :
                    cur_item.pop('shipping_date', None)

                try :
                    Item.objects.create(**cur_item)
                except IntegrityError :
                    messages.warning(request, f"This Item : {cur_item} - already exists. No items were added." )
            os.remove(os.path.join(settings.MEDIA_ROOT, file_name))
            return redirect('items')
    else :
        return render(request, 'myapp/item_csv_uploaded.html', {'items': list_items})
  1. In forms.py : the form is very obvious, but just to be clear
class UploadItemCsvFileForm(forms.Form):
    item_csv_file = forms.FileField()

Here are the questions/points.

a) Even if obviously it could be better, is this solution is acceptable or not at all ?

b) I pass 'uploaded_file' from one view to another using "request.session" is it a good practice? Is there another way to do it without using GET variables?

c) At first my wish was to avoid to save the csv-file. But I could not figure out how to do it? Reading all the file to request.session seems not a good idea for me. Is there some possibility to upload the file into memory in Django?

d) If I have to use the tmp-file. How should I handle the situation if user abandon upload at the middle (for example, he sees the confirmation page, but does not click "yes" and decide to re-write his file). How to remove the tmp-file?

e) Small additional question : what kind of checks there are in Django about uploaded file? For example, how could I check that the file is at least a text-file? Should I do it?

All others remarks are welcome as well.


Solution

  • a) Even if obviously it could be better, is this solution is acceptable or not at all ?

    I think it has some problems you want to address, but the general idea of using the filesystem and storing just filenames can be acceptable, depending on how many users you need to serve and what guarantees regarding data consistency and concurrent accesses you want to make.

    I would consider the uploaded file temporary data that may be lost on system failure. If you want to provide any guarantees of not losing the data, you want to store it in a database instead of on the filesystem.

    b) I pass 'uploaded_file' from one view to another using "request.session" is it a good practice? Is there another way to do it without using GET variables?

    There are up- and downsides to using request.session.

    • attackers can not change the filename and thus retrieve data of other users. This is also the reason why you should not use a GET parameter here: If you used one, attackers could simpy change that parameter and get access to files of other users.
    • users can upload a file, go and do other stuff, and later come back to actually import the file, however:
    • if users end their session, you lose the filename. Also, users can not upload the file on one device, change to another device, and then go on with the import, since the other device will have a different session.

    The last point correlates with the leftover files problem: If you lose your information about which files are still needed, it makes cleaning up harder (although, in theory, you can retrieve which files are still needed from the session store).

    If it is a problem that sessions might end or change because users clear their cookies or change devices, you could consider adding the filename to the UserProfile in the database. This way, it is not bound to sessions.

    c) At first my wish was to avoid to save the csv-file. But I could not figure out how to do it? Reading all the file to request.session seems not a good idea for me. Is there some possibility to upload the file into memory in Django?

    You want to store state. The go-to ways of storing state are the database or a session store. You could load the whole CSVFile and put it into the database as text. Whether this is acceptable depends on your databases ability to handle large, unstructured data. Traditional databases were not originally built for that, however, most of them can handle small binary files pretty well nowadays. A database could give you advantages like ACID guarantees where concurrent writes to the same file on the file system will likely break the file. See this discussion on the dba stackexchange

    Your database likely has documentation on the topic, e.g. there is this page about binary data in postgres.

    d) If I have to use the tmp-file. How should I handle the situation if user abandon upload at the middle (for example, he sees the confirmation page, but does not click "yes" and decide to re-write his file). How to remove the tmp-file?

    Some ideas:

    • Limit the count of uploaded files per user to one by design. Currently, your filename is based on a timestamp. This breaks if two users simultaneously decide to upload a file: They will both get the same timestamp, and the file on disk may be corrupted. If you instead use the user's primary key, this guarantees that you have at most one file per user. If they later upload another file, their old file will be overwritten. If your user count is small enough that you can store one leftover file per user, you don't need additional cleaning. However, if the same user simultaneusly uploads two files, this still breaks.
    • Use a unique identifier, like a UUID, and delete the old stored file whenever the user uploads a new file. This requires you to still have the old filename, so session storage can not be used with this. You will still always have the last file of the user in the filesystem.
    • Use a unique identifier for the filename and set some arbitrary maximum storage duration. Set up a cronjob or similar that regularly goes through the files and deletes all files that have been stored longer than your specified maximum duration. If a user uploads a file, but does not do the actual import soon enough, their data is deleted, and they would have to do the upload again. Here, your code has to handle the case that the file with the stored filename does not exist anymore (and may even be deleted while you are reading the file).

    You probably want to limit your server to one file stored per user so that attackers can not fill your filesystem.

    e) Small additional question : what kind of checks there are in Django about uploaded file? For example, how could I check that the file is at least a text-file? Should I do it?

    You definitely want to set up some maximum file size for the file, as described e.g. here. You could limit the allowed file extensions, but that would only be a usability thing. Attackers could also give you garbage data with any accepted extension.

    Keep in mind: If you only store the csv as text data that you load and parse everytime a certain view is accessed, this can be an easy way for attackers to exhaust your servers, giving them an easy DoS attack.


    Overall, it depends on what guarantees you want to make, how many users you have and how trustworthy they are. If users might be malicious, you want to keep all possible kinds of data extraction and resource exhaustion attacks in mind. The filesystem will not scale out (at least not as easily as a database).

    I know of a similar setup in a project where only a handful of priviliged users are allowed to upload stuff, and we can tolerate deletion of all temporary files on failure. Users will simply have to reupload their files. This works fine.