Search code examples
pythonpdfdocxpythonanywherepython-docx

Converting docx to pdf with pure python (on linux, without libreoffice)


I'm dealing with a problem trying to develop a web-app, part of which converts uploaded docx files to pdf files (after some processing). With python-docx and other methods, I do not require a windows machine with word installed, or even libreoffice on linux, for most of the processing (my web server is pythonanywhere - linux but without libreoffice and without sudo or apt install permissions). But converting to pdf seems to require one of those. From exploring questions here and elsewhere, this is what I have so far:

import subprocess

try:
    from comtypes import client
except ImportError:
    client = None

def doc2pdf(doc):
    """
    convert a doc/docx document to pdf format
    :param doc: path to document
    """
    doc = os.path.abspath(doc) # bugfix - searching files in windows/system32
    if client is None:
        return doc2pdf_linux(doc)
    name, ext = os.path.splitext(doc)
    try:
        word = client.CreateObject('Word.Application')
        worddoc = word.Documents.Open(doc)
        worddoc.SaveAs(name + '.pdf', FileFormat=17)
    except Exception:
        raise
    finally:
        worddoc.Close()
        word.Quit()


def doc2pdf_linux(doc):
    """
    convert a doc/docx document to pdf format (linux only, requires libreoffice)
    :param doc: path to document
    """
    cmd = 'libreoffice --convert-to pdf'.split() + [doc]
    p = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
    p.wait(timeout=10)
    stdout, stderr = p.communicate()
    if stderr:
        raise subprocess.SubprocessError(stderr)

As you can see, one method requires comtypes, another requires libreoffice as a subprocess. Other than switching to a more sophisticated hosting server, is there any solution?


Solution

  • The PythonAnywhere help pages offer information on working with PDF files here: https://help.pythonanywhere.com/pages/PDF

    Summary: PythonAnywhere has a number of Python packages for PDF manipulation installed, and one of them may do what you want. However, shelling out to abiword seems easiest to me. The shell command abiword --to=pdf filetoconvert.docx will convert the docx file to a PDF and produce a file named filetoconvert.pdf in the same directory as the docx. Note that this command will output an error message to the standard error stream complaining about XDG_RUNTIME_DIR (or at least it did for me), but it still works, and the error message can be ignored.