I'm tasked to convert a series of tables from .doc
and .docx-files
to .xls
,
But have not managed to find an efficient way to do this. The tables may be in between other text.
I have looked into pywin32, xlwt
and a couple of other libraries, but it seems like I have to go through a lot of steps.
Any tips for this table conversion from *.doc/*.docx
to *.xls
file?
I'm assuming you have too many documents for copy/paste, and seek a pragmatic solution for internal use. This solution:
Create a macro in Word such as this:
Sub BatchSaveAs()
' Set output_dir appropriately
ChangeFileOpenDirectory "output_dir"
outDocName = Left(ActiveDocument.Name, Len(ActiveDocument.Name) - 4) & ".xls"
ActiveDocument.SaveAs FileName:=outDocName, FileFormat:= _
wdFormatFilteredHTML, LockComments:=False, Password:="", AddToRecentFiles _
:=True, WritePassword:="", ReadOnlyRecommended:=False, EmbedTrueTypeFonts _
:=False, SaveNativePictureFormat:=False, SaveFormsData:=False, _
SaveAsAOCELetter:=False
ActiveWindow.View.Type = wdWebView
Application.Quit SaveChanges:=wdDoNotSaveChanges
End Sub
Now you can run Word in batch mode through a script which calls it for each input file:
winword file_name /mBatchSaveAs
(You may need to use full path names)
If the warning on opening the HTML / Excel files is not OK, you could write a little Python script to run Excel in batch mode. This shows how to run Excel in from Python:
Python com between python and excel
Some tricks I found useful: use finally for your clean-up; the code you need looks like VBA code, and if you're not good at VBA, record a macro to do what you want and modify for Python syntax.