Search code examples
xmlpython-2.7docxelementtreedoc

Reading a .doc extension file ,ElementTree


I have successfully read .docx files using ElementTree package using zipfile. But I realized that there isn't the archive 'word/document.xml'for .doc files . I looked into the docs but did not find any. How can it be read? For docx, i used :

import zipfile as zf
import xml.etree.ElementTree as ET
z = zf.ZipFile("test.docx")
doc_xml = z.open('word/document.xml')
tree = ET.parse(doc_xml)

Using the above for .doc gives :

KeyError: "There is no item named 'word/document.xml' in the archive"

I saw something for read in ElementTree docs but that is for xml files only.

doc_xml = open('yesblue.doc','r')  

How should go about this one? Maybe something like converting .doc into .docx in python itself.

Edit: The .doc format stores data in binary and XML cannot be used for it.


Solution

  • After some serious searching ,I realized that it would be better to use the comtypes package to convert it from .doc to .docx format. This has its own set of disadvantages like Windows exclusivity and the need for Microsoft Office installed.

    import sys
    import os
    import comtypes.client
    in_file = os.path.abspath('')
    out_file = os.path.abspath('yesblue') #name of output file added to the current working directory 
    word = comtypes.client.CreateObject('Word.Application')
    doc = word.Documents.Open('yesblue.doc') #name of input file
    doc.SaveAs(out_file, FileFormat=16)  # output file format to Office word Xml default (code=16)
    doc.Close()
    word.Quit()    
    

    The list of codes are contained here.

    The output docx file can be used for further processing in ElementTree.