Search code examples
pythondjangoexcellibreoffice

Django upload and process multiple files failing with libreoffice


I'm working on a Django application that works with excel files. It only works with xlsx files but if you upload an xls or an ods file I convert it previously to xlsx in order to work with that processed file. My application supports multiple file upload in the form. All files uploaded are uploaded successfully and saved into a model in Database with a field status = 'Created'. A post-save model function triggers a new Thread that process files for processing those files in background. After files are processed them are saved as status = 'Error' or status = 'Processed'. I have also added an extra option to reprocess files.

The problem comes when I try to upload multiple files which are not xlsx those files need to be converted to xlsx before my own processing stuff. For that purpose I'm using libreoffice --convert-to xlsx filename --headless in a python subprocess. This is working fine with one or two file upload at the same time. But if I upload multiple files at the same time, some are failing and some files are being processed successfully, and there aren't any pattern with the files. All files for testing works properly if I upload them one by one, or even if I reprocess those files.

The error is given by libreoffice, because if I upload multiple files which are already xlsx files are being processed successfully too. When this happens, libreoffice returns 1 an no stdout nor stderr.

models.py

class Document(models.Model):
    docfile = models.FileField(upload_to='documents/%Y/%m/%d')
    date_creation = models.DateTimeField(auto_now_add=True)
    document_type = models.TextField(max_length=256)
    status = models.TextField(max_length=256, default="Created")
    bank_account = models.TextField(max_length=256, null=True)

    def filename(self):
        return os.path.basename(self.docfile.name)


@receiver(models.signals.post_save, sender=Document)
def process_file(sender, instance, **kwargs):
    t = threading.Thread(target=process_file_function,args=[sender,instance],daemon=True)
    t.start()

functions.py

def process_file_function(sender, instance, **kwargs):
    from accounting.models import Asiento, Apunte, FiltroBanco
    import pytz

    if instance.status == "Created" or instance.status == "Reprocess":
        filename = file = instance.docfile.name
        instance.status='Processing'
        instance.save(update_fields=['status'])

        print(f"Starting processing file: {file}")

        try:
            if filename.endswith('.ods') or filename.endswith('xls'):
                import os
                print(os.stat(filename))
                output = subprocess.run(["libreoffice", "--convert-to", "xlsx", filename, "--headless", "--outdir", "/tmp/sage/"], capture_output=True)
                print(output)
                filename = f"/tmp/sage/{filename.split('/')[-1].replace('xls', 'xlsx').replace('.ods', '.xlsx')}"

            wb = load_workbook(filename=filename, data_only=True)

            # Do my stuff

            instance.status='Processed'
            instance.save()
            print(f"Finished processing file: {file}")
        except Exception as e:
            instance.status='Error'
            instance.save()

Otuput example of a successful file:

Starting processing file: documents/2023/02/19/filename02.ods
os.stat_result(st_mode=33188, st_ino=901900, st_dev=40, st_nlink=1, st_uid=1000, st_gid=1000, st_size=29771, st_atime=1676805630, st_mtime=1676805630, st_ctime=1676805630)
CompletedProcess(args=['libreoffice', '--convert-to', 'xlsx', 'documents/2023/02/19/filename02.ods', '--headless', '--outdir', '/tmp/sage/'], returncode=0, stdout=b'convert /home/ajulian/Documents/code/python/facturasweb/documents/2023/02/19/filename02.ods -> /tmp/sage/filename02.xlsx using filter : Calc Office Open XML\n', stderr=b'')
Finished processing file: documents/2023/02/19/filename02.ods

Output example of a error file:

Starting processing file: documents/2023/02/19/filename01.ods
os.stat_result(st_mode=33188, st_ino=901899, st_dev=40, st_nlink=1, st_uid=1000, st_gid=1000, st_size=21469, st_atime=1676805630, st_mtime=1676805630, st_ctime=1676805630)
CompletedProcess(args=['libreoffice', '--convert-to', 'xlsx', 'documents/2023/02/19/filename01.ods', '--headless', '--outdir', '/tmp/sage/'], returncode=1, stdout=b'', stderr=b'')
ERROR: Error processing file: documents/2023/02/19/filename01.ods
-----
Traceback (most recent call last):
  File "/home/ajulian/Documents/code/python/facturasweb/accounting/functions.py", line 50, in process_file_function
    wb = load_workbook(filename=filename, data_only=True)
  File "/usr/local/lib/python3.9/dist-packages/openpyxl/reader/excel.py", line 315, in load_workbook
    reader = ExcelReader(filename, read_only, keep_vba,
  File "/usr/local/lib/python3.9/dist-packages/openpyxl/reader/excel.py", line 124, in __init__
    self.archive = _validate_archive(fn)
  File "/usr/local/lib/python3.9/dist-packages/openpyxl/reader/excel.py", line 96, in _validate_archive
    archive = ZipFile(filename, 'r')
  File "/usr/lib/python3.9/zipfile.py", line 1239, in __init__
    self.fp = io.open(file, filemode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/sage/filename01.xlsx'

Notice the difference between the output of libreoffice subprocess. Don't blame filename01.ods because this file in other executions was converted successfully. It only happens on multiple file upload and not to all files.

What could be the problem? Why happens this sometimes and sometimes not? Why libreoffice only returns 1 without any output?

Thanks in advance.


Solution

  • Solved this issue. The problem happens when libreoffice tries to open same user configuration at the same time. Solved by creating a new userspace for each file: "-env:UserInstallation=file://{tmpfile}"

    tmpfile = f"/tmp/sage/sessions/{filename.split('/')[-1].split('.')[0]}"
    subprocess.run(["libreoffice", "--convert-to", "xlsx", filename, "--headless", "--outdir", "/tmp/sage/", f"-env:UserInstallation=file://{tmpfile}"], capture_output=True)