I have quite a large collection of DOCX documents, and I need to delete all but the first page in all of them. From what I have read, docx-python does not support this since it has no notion of pages. One option I have considered is converting to PDF, deleting the pages, and converting back to DOCX, but I am concerned this will break the formatting somewhat not to mention probably be slow for so many documents. What is my best option here?
Something like:
for page in pages[1:]:
del page
Okay so with some help from libreoffice forum members I have a solution: a macro. It's relatively slow but it is what it is. Note that this deletes all non-first pages but you can with some work rewrite this to select a particular page or a range of pages.
note: Warning to future readers: Good if this approximation works for you, but you should realize that there's no guarantee that LIbreOffice's pagination algorithm will match that of Microsoft Word's, so users who use Word may see different deletions. As such, you probably don't want to use this in a production pipeline, and for one-offs, you might be better off using Word Automation to get results closer to what most user would be seeing as a "page". Bottom line: Any design dependent upon DOCX "pages" at the document data level alone is intrinsically flawed. – user @kjhughes
Macro:
Dim doc, cursor
Dim props2(0) As New com.sun.star.beans.PropertyValue
Dim props(0) As New com.sun.star.beans.PropertyValue
props(0).Name="Hidden"
props(0).Value=True
For i = start To end_-1
doc = StarDesktop.LoadComponentFromUrl("file:///path_to_your_document_folder/" + subdir + "/doc" + i + ".docx", "_default", 0, props)
cursor = doc.CurrentController.getViewCursor()
cursor.gotoStart(false)
If cursor.jumpToNextPage() Then
cursor.gotoEnd(true)
cursor.setString("")
End If
doc.store(props1)
doc.close(true)
Next i
End Sub
soffice command through python:
clip_cmd = 'soffice --nologo --nofirststartwizard --norestore'
f' "macro:///Standard.Module1.del(0, 1000, <subdir_name>)"'
a = time.time()
print(f"clipping subdir <subdir_name>.")
sp.call(clip_cmd, shell=True, stdout=null)
print(f"This batch took {time.time() - a} seconds.")
Of course, make sure the del
macro is saved to your libreoffice user.