Search code examples
pythonpython-3.xdocxextractpython-docx

How to extract text data in a table created in a docx document


I would like to extract text from docx document, I come up with a script extracting text from docx document but I noticed that some document have table and the script do not work on them, How can I improve the above script :


import glob
import os

import docx

with open('your_file.txt', 'w') as f:
    for directory in glob.glob('fi*'):
        for filename in glob.glob(os.path.join(directory, "*")):
            if filename.endswith((".docx", ".doc")):
                document = docx.Document(filename)    
                for paragraph in document.paragraphs:
                    if paragraph.text:
                        #docText.append(paragraph.text)
                        f.write("%s\n" % paragraph.text)


docx with table

enter image description here


Solution

  • Try using python-docx module instead

    pip install python-docx

    import docx
    
    doc = docx.Document("document.docx")
    
    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                print(cell.text)