Search code examples
pythonms-wordms-officepython-docx

Python-docx: How can I remove last page from a word document


I am trying to remove the last page from the word document, but haven't found any solution yet. More precisely I want to remove a section from a document.

document.sections[-1]

Can be used to access the last section, but how can I remove it.


Solution

  • It turns out that the unfortunately short answer seems to be: you can't do this with python-docx, at least not with their API. If you dug down into the guts you could probably hack something out that would work for your specific case. But in the last 10-15 minutes of research I did this doesn't appear to be possible.

    Here's a few issues:

    1. Python-docx does not have notions of pages, see Python-docx: identify a page break in paragraph
    2. Copying content from one doc to another (or equivalently creating an empty doc and copying content to it) is quite complex and in general is not supported by python-docx. See combine word document using python docx

    Though from the posts in (2) it seems there might be an alternative package that could help (https://pypi.org/project/docxcompose/).

    Edit: This is as far as I got. It's quite kludgy but worked with a very quick basic test, though I think it's partially broken. And it left a blank page at the end. This definitely doesn't solve the question, but maybe could be a starting point to dig more.

    import docx
    
    d = docx.Document('test.docx')
    new_doc = docx.Document()
    
    def get_last_page_break(document):
        paragraph_index = 0
        for paragraph in document.paragraphs:
            paragraph_index += 1
            run_index = 0
            for run in paragraph.runs:
                run_index += 1
                if 'lastRenderedPageBreak' in run._element.xml: # soft page break
                    lastpara_index = paragraph_index
                    lastrun_index = run_index
                elif 'w:br' in run._element.xml and 'type="page"' in run._element.xml: # hard page break
                    lastpara_index = paragraph_index
                    lastrun_index = run_index
        return lastpara_index, lastrun_index
    
    def kludgy_remove_last_page(document):
        new_doc = docx.Document()
        last_para, lastrun_index = get_last_page_break(d)
    
        for para in d.paragraphs[:last_para]:
            new_para = new_doc.add_paragraph()
            for run in para.runs[:lastrun_index]:
                new_para.add_run(run.text)
                if 'w:br' in run._element.xml and 'type="page"' in run._element.xml: # hard page break
                    new_doc.add_page_break()
        return new_doc
    
    new_doc = kludgy_remove_last_page(d)
    new_doc.save('removed.docx')