Search code examples
pythonxmlparsingdocxpython-docx

Parsing of table from .docx file


I want to parse a table from a .docx file using Python and python-docx into some useful data structure.

The .docx file contains only a single table in my case. I've uploaded it so you can have a look. Here's a screenshot:

Books.docx


Solution

  • You can use the snippet below to parse your document into a list where each row is a dictionary mapping the table header value to the column value.

    from docx.api import Document
    
    # Load the first table from your document. In your example file,
    # there is only one table, so I just grab the first one.
    document = Document('Books.docx')
    table = document.tables[0]
    
    # Data will be a list of rows represented as dictionaries
    # containing each row's data.
    data = []
    
    keys = None
    for i, row in enumerate(table.rows):
        text = (cell.text for cell in row.cells)
    
        # Establish the mapping based on the first row
        # headers; these will become the keys of our dictionary
        if i == 0:
            keys = tuple(text)
            continue
    
        # Construct a dictionary for this row, mapping
        # keys to values for this row
        row_data = dict(zip(keys, text))
        data.append(row_data)
    

    This will give you:

    data = [
      {u'Pub.': u'Penguin Books',
       u'Auther': u'Edward de BONO',
       u'Sr. No.': u'1',
       u'Name of Book': u'Six Thinking Hats'
      },
      ...
    ]
    

    If you'd just want a tuple for each row, you should instead of creating a dictionary just set row_data to the tuple value of text, so in the loop instead of constructing the dict, do:

    # Construct a tuple for this row
    row_data = tuple(text)
    data.append(row_data)
    

    Now, data would hold something like this instead:

    data = [
      (u'1',
       u'Six Thinking Hats',
       u'Edward de BONO',
       u'Penguin Books'
      ),
     ...
    ]
    

    Then you can skip constructing keys, obviously (but still skip the first row!).