Search code examples
pythonms-wordpython-docx

How to extract a Word table from multiple files using python docx


I am working on a project at work where I need to analyze over a thousand MS-Word files each consisting of the same table. From each table I just need to extract few cells and turn them into a row that later will be concatenated to create a dateframe for further analysis.

I tested Python's library docx on one file and it managed to read the table. However, after plugging the same function inside a for loop that begins by creating a variable consisting of all the file names and then passing that to the Document function, the output is just one table, which is the first table in the list of files.

I have a feeling I'm not looking at this the right way, I would appreciate any guidance on this as I'm completely helpless now.

following is the code I used, it consists mainly of code I stumbled upon in stackoverflow:

import os
import pandas as pd
file = [f for f in os.listdir() if f.endswith(".docx") ]

for name in file:
    document = Document(name)
    table = document.tables[0]
    data = []

    keys = None
    for i, row in enumerate(table.rows):
        text = (cell.text for cell in row.cells)

        # Establish the mapping based on the first row
        # headers; these will become the keys of our dictionary
        if i == 0:
            keys = tuple(text)
            continue

        # Construct a dictionary for this row, mapping
        # keys to values for this row
        row_data = dict(zip(keys, text))
        data.append(row_data)

thanks


Solution

  • You are reinitializing the data list to [] (empty) for every document. So you carefully collect the row-data from a document and then in the next step throw it away.

    If you move data = [] outside the loop then after iterating through the documents it will contain all the extracted rows.

    data = []
    
    for name in filenames:
        ...
        data.append(row_data)
    
    print(data)