Search code examples
pythonpython-3.xpandasdataframepython-docx

How to create a Data frame and prevent creation of new columns and additional rows during a for loop for each dataset


I'm new to posting here.

I'm currently trying to extract tables from a word document and have them laid out in a transposed data frame that can be exported as a csv.

My issue lies on the data frame I get from the following code:

from docx.api import Document
import pandas as pd

def extract_tables_from_docx(path,output_path,name):
    document = Document(path)
    data = []
    for table in document.tables:
        keys = tuple(cell.text for cell in table.rows[0].cells)
        for row in table.rows[1:]:
            data.append(dict(zip(keys,(cell.text for cell in row.cells))))
    
    df1 = pd.DataFrame(data).T
    print(df1)

This is the current data frame I get when I input the relevant information when calling the function

So the issue is that I'm adding extra columns to fill in the information for the next data set when I want the data to be filled where the NaN's are. Basically every new entry from the loop is causing the data to be entered to the right if that's how you describe it. I'm fairly new to Python so apologies if this code doesn't look good.

Can anyone help on how I get around this? Any help is appreciated.

Edit:

This is how I expect my data frames to appear

The dataset I'm using


Solution

  • Your data is organized "vertically" with the records in columns rather than rows. So you need something like this:

    from docx.api import Document
    import pandas as pd
    
    
    def extract_tables_from_docx(path):
        document = Document(path)
        data = []
    
        for table in document.tables:
            keys = (cell.text for cell in table.columns[0].cells)
            values = (cell.text for cell in table.columns[1].cells)
            data.append(dict(zip(keys, values)))
    
        df1 = pd.DataFrame(data).T
        print(df1)
    

    Give that a try and see what you get.