Search code examples
pythonpdflibreofficeconvertersdoc

Universal PDF converter


I am looking for a help with "any document converter", where any document file [doc, docx, ppt, pptx] will be converted to pdf. DOCX and PPTX are easy to handle with python libraries, but DOC and PPT is a bit tricky.
The answers I've got 7 month ago was quite a bit hard to deal with. Especially the one with use of Unoconv (now its deprecated and changed to Unoserv).

Initial code example:

import os
import shutil

src = ".../srcpaths"
dst = ".../dstpaths"
ext = ['ppt', 'pptx', 'doc', 'docx']

for root, subfolders, filenames in os.walk(src):
    for filename in filenames:
        if os.path.splitext(filename)[1] in ext:
            shutil.copy2(os.path.join(root, filename), os.path.join(dst, filename))            
        
def ConvertToPDF(ext):
    #some code#

ConvertToPDF('.ppt')
ConvertToPDF('.pptx')
ConvertToPDF('.doc')
ConvertToPDF('.docx')

Solution

  • Below is my review of solutions and an answer at the end:

    1. Pandoc:

    • requires pdf latex processor
    • not preserving the shape of files well
    • loss of formatting
    • problems with graphics
    • problems with charts
    • problems with fonts
    • low on formats choice

    2. Unoconv/Unoserver

    • hard to install and deal with
    • requires Libre Office as engine
    • good conversion results (not perfect)

    3. Cloud-based solutions:

    • not free
    • not open-source friendly
    • privacy concerns

    4. Google Drive API converter:

    • using someone’s account
    • upload document – Convert it – Save it as PDF
    • privacy concerns

    5. LibreLambda

    • uses Amazon Web Services (AWS)
    • privacy concerns

    Simple solution:

    Use the software straightly by running it in a cmd subprocess.

    Needs: installation of LibreOffice. Biggest advantage: can run both on Windows and Linux (should be modified for linux)

    Here is my Python code for Windows:

    import os
    import subprocess
    
    # path to the engine
    path_to_office = r"C:\Program Files\LibreOffice\program\soffice.exe"
    
    # path with files to convert
    source_folder = r"C:\ConvertToPDF\input_files"
    
    # path with pdf files
    output_folder = r"C:\ConvertToPDF\output_files"
    
    # changing directory to source
    os.chdir(source_folder)
    
    # assign and running the command of converting files through LibreOffice
    command = f"\"{path_to_office}\" --convert-to pdf  --outdir \"{output_folder}\" *.*"
    subprocess.run(command)
    
    print('Converted')
    

    If you can modify it to Linux, please feel free to share your solution