I'm new to posting here.
I'm currently trying to extract tables from a word document and have them laid out in a transposed data frame that can be exported as a csv.
My issue lies on the data frame I get from the following code:
from docx.api import Document
import pandas as pd
def extract_tables_from_docx(path,output_path,name):
document = Document(path)
data = []
for table in document.tables:
keys = tuple(cell.text for cell in table.rows[0].cells)
for row in table.rows[1:]:
data.append(dict(zip(keys,(cell.text for cell in row.cells))))
df1 = pd.DataFrame(data).T
print(df1)
This is the current data frame I get when I input the relevant information when calling the function
So the issue is that I'm adding extra columns to fill in the information for the next data set when I want the data to be filled where the NaN's are. Basically every new entry from the loop is causing the data to be entered to the right if that's how you describe it. I'm fairly new to Python so apologies if this code doesn't look good.
Can anyone help on how I get around this? Any help is appreciated.
Edit:
Your data is organized "vertically" with the records in columns rather than rows. So you need something like this:
from docx.api import Document
import pandas as pd
def extract_tables_from_docx(path):
document = Document(path)
data = []
for table in document.tables:
keys = (cell.text for cell in table.columns[0].cells)
values = (cell.text for cell in table.columns[1].cells)
data.append(dict(zip(keys, values)))
df1 = pd.DataFrame(data).T
print(df1)
Give that a try and see what you get.