I have a word document that is a standard template used for our wider meetings. There are two columns in my word document, First Column has all the headers second column holds the actual details of that. I have attached a screenshot of the same to show the structure of the word document.
I would now like to extract the text from both of the columns using python and store them in a dataframe. The resultant dataframe should look like the following:
Title In Force? Date Who attended the event?
Test Yes 03/10/1999 X, Y
How can I achieve this?
Here is the parser from abdulsaboor's
answer:
def get_table_from_docx(document):
tables = []
for table in document.tables:
df = [['' for i in range(len(table.columns))] for j in range(len(table.rows))]
for i, row in enumerate(table.rows):
for j, cell in enumerate(row.cells):
if cell.text:
df[i][j] = cell.text
tables.append(pd.DataFrame(df))
return tables #It returns list of DataFrames
Then:
df = get_table_from_docx(document)[0]
df = df.set_index(0).T # Use transpose.
df["Who attended the event?"] = df["Who attended the event?"].str.replace("\n",", ") #bullets appears "/n". Let's replace it with comma.
Out:
0 Title In Force? Date Who attended the event?
1 Test Yes 03/10/1999 X, Y
Note: If you have multiple tables in doc you can use this:
df_list = get_table_from_docx(document)
final_df = pd.DataFrame()
for i in df_list:
i = i.set_index(0).T
i["Who attended the event?"] = i["Who attended the event?"].str.replace("\n",", ") # you can do this outside the loop.
final_df = pd.concat([final_df,i])