Search code examples
pythonpython-docx

Search through Word tables for certain text Python docx


I have some code that reads a table in a Word document and makes a dataframe from it.

import numpy as np
import pandas as pd
from docx import Document

####    Time for some old fashioned user functions    ####
def make_dataframe(f_name, table_loc):
    document = Document(f_name)
    tables = document.tables[table_loc]

    for i, row in enumerate(tables.rows):
        text = (cell.text for cell in row.cells)
        if i == 0:
            keys = tuple(text)
            continue

        row_data = dict(zip(keys, text))
        data.append(row_data)
    df = pd.DataFrame.from_dict(data)
    return df


SHRD_filename = "SHRD - 12485.docx"
SHDD_filename = "SHDD - 12485.docx"

df_SHRD = make_dataframe(SHRD_filename,30)
df_SHDD = make_dataframe(SHDD_filename,-60)

Because the files are different (for instance the SHRD has 32 tables and the one I am looking for is the second to last, but the SHDD file has 280 tables, and the one I am looking for is 60th from the end. But that may not always be the case.

How do I search through the tables in a document and start working on the one that cell[0,0] = 'Tag Numbers'.


Solution

  • You can iterate through the tables and check the text in the first cell. I have modified the output to return a list of dataframes, just in case more than one table is found. It will return an empty list if no table meets the criteria.

    def make_dataframe(f_name, first_cell_string='tag number'):
        document = Document(f_name)
    
        # create a list of all of the table object with text of the
        # first cell equal to `first_cell_string`
        tables = [t for t in document.tables 
                  if t.cell(0,0).text.lower().strip()==first_cell_string]
    
        # in the case that more than one table is found 
        out = []
        for table in tables:
            for i, row in enumerate(table.rows):
                text = (cell.text for cell in row.cells)
                if i == 0:
                    keys = tuple(text)
                    continue
    
                row_data = dict(zip(keys, text))
                data.append(row_data)
            out.append(pd.DataFrame.from_dict(data))
        return out