Search code examples
pythonpython-3.xdocumentextracttext-extraction

Extraction of text page by page from MS word docx file using python


I have a MS docx file and I need to extract text from it page-wise. I have tried with python-docx but it could extract the whole text but not pagewise. I have also converted my docx to pdf and then tried text extraction. The problem is, after conversion the page structure of docx got changed. For example, while converted,the font size got changed and the text content in one page of docx took more than one page in the pdf.

I was looking for a stable solution that would extract page wise text from docx (Without converting to pdf would be better for my whole solution). Can somebody help me on this?


Solution

  • I found that Tika library had a xmlContent parsing when reading the file. I used it to capture xml format and used regex to capture it. Writing below the python code that worked for me.

    raw_xml = parser.from_file(file, xmlContent=True)
    body = raw_xml['content'].split('<body>')[1].split('</body>')[0]
    body_without_tag = body.replace("<p>", "").replace("</p>", "").replace("<div>", "").replace("</div>","").replace("<p />","")
    text_pages = body_without_tag.split("""<div class="page">""")[1:]
    num_pages = len(text_pages)
    if num_pages==int(raw_xml['metadata']['xmpTPg:NPages']) : #check if it worked correctly
         return text_pages