Search code examples
pythonopenpyxlpython-docx

Generating organized excel spreadsheet from a word document


I have Microsoft document which we want to transfer to excel. Every sentence needs to be separated and then pasted into the next appropriate cell in excel. These sentences also need to be analyzed as a heading, requirement, or informational. I will recreate what the typical word format looks like

2.3.4     Lightening Transient Response 
          The device shall meet spec 24532. Voltage must resemble figure.
          Figure 1.

which translates to

<numbering>      <Heading>
                 <Requirements/information>

In excel that is almost exactly how I would the document to look except the second requirement sentence should be in row just below the previous requirement sentence.

2.3.4   | Lightening Transient Response     | Heading
        | The device shall meet spec 24532. | Requirement
        |Voltage must resemble figure       | Requirement
        |figure  1                          | Informational

I have attempted this project with python using openxl and docx modules. I have code that can go into word and get sentences and then code that can analyze the sentence.I'm retrieving runs from paragraphs. I am having problems because not all sentences are coming back due to how the word document is formatted. I am typically only getting the headings back. The heading numbers are not stored in runs. The requirements underneath the headings are stored in tables. I have written some code to get into the tables an extract the text from cells so that is one way to get the requirements however that snippet of code is giving problems(giving me the same sentence three times in a row).

I'm looking for other possible ways to do this. I'm thinking a format switch. XML has been mentioned and then also the pdf and pythons pdf module may be possible.

Any thoughts or advice would be greatly appreciated.

-Chris


Solution

  • XML is going to be harder, not easier. You're closer than you seem to think. I recommend attacking each problem separately until you crack it.

    The sentence three times problem in the table is because of merged cells. The way python-docx works on tables, there is an underlying table layout of x rows and y columns. If two side-by-side cells are merged, you get the same results for both those cells. You can detect this be comparing the two cells for equality. Roughly like "if this_cell == last_cell skip this cell".

    There's no way around the heading problem. Heading numbers only exist inside a running instance of Word; they are generated at display (or print) time. To get those you need to use the same rules to generate your own numbers. So you'd need to keep track of the number of headings you've passed through etc. and form your own dot-separated numbering.