Search code examples
pythonxlsdocxpywin32xlwt

Python - Convert tables from .doc / .docx-files to .xls


I'm tasked to convert a series of tables from .doc and .docx-files to .xls,

But have not managed to find an efficient way to do this. The tables may be in between other text.

I have looked into pywin32, xlwt and a couple of other libraries, but it seems like I have to go through a lot of steps.

Any tips for this table conversion from *.doc/*.docx to *.xls file?


Solution

  • I'm assuming you have too many documents for copy/paste, and seek a pragmatic solution for internal use. This solution:

    • Opens the file in Word in batch mode
    • You could write a little script to cut everything outside tags from the HTML
    • Saves the file in HTML, but using a .xls extension
    • The HTML file will open in Excel by default and you just click away the warning.

    Create a macro in Word such as this:

    Sub BatchSaveAs()
        ' Set output_dir appropriately
        ChangeFileOpenDirectory "output_dir"
    
        outDocName = Left(ActiveDocument.Name, Len(ActiveDocument.Name) - 4) & ".xls"
    
        ActiveDocument.SaveAs FileName:=outDocName, FileFormat:= _
            wdFormatFilteredHTML, LockComments:=False, Password:="", AddToRecentFiles _
            :=True, WritePassword:="", ReadOnlyRecommended:=False, EmbedTrueTypeFonts _
            :=False, SaveNativePictureFormat:=False, SaveFormsData:=False, _
            SaveAsAOCELetter:=False
    
        ActiveWindow.View.Type = wdWebView
    
        Application.Quit SaveChanges:=wdDoNotSaveChanges
    End Sub
    

    Now you can run Word in batch mode through a script which calls it for each input file:

    winword file_name /mBatchSaveAs
    

    (You may need to use full path names)

    If the warning on opening the HTML / Excel files is not OK, you could write a little Python script to run Excel in batch mode. This shows how to run Excel in from Python:

    Python com between python and excel

    Some tricks I found useful: use finally for your clean-up; the code you need looks like VBA code, and if you're not good at VBA, record a macro to do what you want and modify for Python syntax.