Search code examples
pythonocrdocxtext-parsing

Extract text from Word document and store in an excel file using python


I have a word document that is a standard template used for our wider meetings. There are two columns in my word document, First Column has all the headers second column holds the actual details of that. I have attached a screenshot of the same to show the structure of the word document.enter image description here

I would now like to extract the text from both of the columns using python and store them in a dataframe. The resultant dataframe should look like the following:

Title     In Force?   Date          Who attended the event? 
Test      Yes         03/10/1999    X, Y

How can I achieve this?


Solution

  • Here is the parser from abdulsaboor's answer:

    def get_table_from_docx(document):
        tables = []
        for table in document.tables:
            df = [['' for i in range(len(table.columns))] for j in range(len(table.rows))]
            for i, row in enumerate(table.rows):
                for j, cell in enumerate(row.cells):
                    if cell.text:
                        df[i][j] = cell.text
            tables.append(pd.DataFrame(df))
        return tables   #It returns list of DataFrames
    

    Then:

    df = get_table_from_docx(document)[0]
    df = df.set_index(0).T # Use transpose.
    df["Who attended the event?"] = df["Who attended the event?"].str.replace("\n",", ") #bullets appears "/n". Let's replace it with comma.
    

    Out:

    0 Title In Force?    Date          Who attended the event?
    1 Test  Yes          03/10/1999    X, Y
    

    Note: If you have multiple tables in doc you can use this:

    df_list = get_table_from_docx(document)
    final_df = pd.DataFrame()
    for i in df_list:
        i = i.set_index(0).T
        i["Who attended the event?"] = i["Who attended the event?"].str.replace("\n",", ") # you can do this outside the loop.
        final_df = pd.concat([final_df,i])