Search code examples
python-2.7loopspdf-conversioncomtypes

Looping over the .doc files to convert them to .pdf (Python)


I am looking for a solution to convert .doc files to .pdf in Python 2.7.x It seems to be quite not so straight-forward to handle .doc files in Python, compared to .docx and pdf. So far the most suitable and working solution seems to be for me this although when I try to extend it to loop over .doc files in a given directory I encounter an error:

_ctypes.COMError: (-2146823114, None, (u"Sorry, we couldn't find your file.     Was it moved, renamed, or deleted?\r (C:\\windows\\system32\\PrivateCourse_AR.doc)", u'Microsoft Word', u'wdmain11.chm', 24654, None))

Here is the code:

import os
import comtypes.client

os.chdir('C:\Users\Domi\PycharmProjects\STStransl-auto\doc')
path = os.getcwd()
print path

input = os.listdir(path)
print input
print len(input)

wdFormatPDF = 17 #pdf

i=0

output = '.\doc2txt_{}'.format(i)

word = comtypes.client.CreateObject('Word.Application')
for file in input:
    if file.endswith('.doc'):
        print file
        doc = word.Documents.Open(file)
        doc.SaveAs(output, FileFormat=wdFormatPDF)
        i += 1
        doc.Close()
        word.Quit()

Any advice regarding code or how to efficiently handle .doc files in Python is welcomed and much appreciated. I am working on an automation script to handle .docx and .pdf files (merge, extract text and split text into multiple files). With those there is not any problem. Pity is, I have a lot of .doc files too. Thanks a lot.


Solution

  • note that the error mentions your file name, but in a system path

    C:\\windows\\system32\\PrivateCourse_AR.doc

    That's because you're not actually calling a Word subprocess but a more complex communication protocol with MSWord, and obviously here MSWord is running using another current directory. So passing relative file paths fails in that case (and fortunately MSWord has the courtesy of providing the absolute path of the non found file)

    To fix that, just do:

    word.Documents.Open(os.path.abspath(file))
    

    to make the path absolute relatively to your script (which is in the correct directory)

    It's probably the same issue/fix for the save part:

    doc.SaveAs(os.path.abspath(output), FileFormat=wdFormatPDF)
    

    Aside: always use raw prefix for windows filepaths, you may have surprises with paths like C:\temp (tab character instead of \t, write r"C:\temp")