Search code examples
rubyxmldatabase-designdocx

How would you parse/store/modify/save docx files


I'm working on an application which has to deal with docx files. I know that docx files are just xml/images/others files in a zip file.

My application would have to:

  1. Importing docx files and store their representation (text, but also eveything related to the presentation such as style, police, font .... ) in a database.

  2. Provide a way to modify the text of each sentence on a webpage.

  3. Exporting the docx file with the new texts while preserving the style/presentation.

The complex thing is that I have to support nested tags. For instance, a tag which contains a sentence can also include some tags to provide some bold to a word.

I do not have any requirements on the database. It can be anything.

My question is more on how to handle and make a representation of the data and how to handle my requirements, not on how to parse XML.

Thanks !


Solution

  • The question is not an easy one.

    Here is some related question I answered: Creating RTF , DOC , or DOCX in iOS

    After you read that, here is a real word example:

    <w:p w:rsidP="00CA7135" w:rsidR="00137C91" w:rsidRDefault="00137C91">
                <w:r>
                    <w:t>Hello</w:t>
                </w:r>
                <w:r w:rsidR="008C194D">
                    <w:t xml:space="preserve"/>
                </w:r>
                <w:r>
                    <w:t>My name</w:t>
                </w:r>
            </w:p>
            <w:p w:rsidP="00CA7135" w:rsidR="008C194D" w:rsidRDefault="00137C91">
                <w:r>
                    <w:t xml:space="preserve">is</w:t>
                </w:r>
                <w:r w:rsidR="008C194D" w:rsidRPr="00E92392">
                    <w:rPr>
                        <w:b/>
                    </w:rPr>
                    <w:t xml:space="preserve">John Doe</w:t>
                </w:r>
                <w:proofErr w:type="spellStart"/>
                <w:r w:rsidR="008C194D" w:rsidRPr="00E92392">
                    <w:rPr>
                        <w:b/>
                    </w:rPr>
                    <w:t/>
                </w:r>
                <w:proofErr w:type="spellEnd"/>
                <w:r w:rsidR="008C194D" w:rsidRPr="00E92392">
                    <w:rPr>
                        <w:b/>
                    </w:rPr>
                    <w:t xml:space="preserve"/>
                </w:r>
                <w:r w:rsidR="008C194D">
                    <w:t xml:space="preserve"/>
                </w:r>
                <w:r>
                    <w:t>I want to</w:t>
                </w:r>
                <w:r w:rsidR="008C194D">
                    <w:t xml:space="preserve"/>
                </w:r>
                <w:r>
                    <w:t>show</w:t>
                </w:r>
                <w:r w:rsidR="00E92392">
                    <w:t xml:space="preserve">how difficult it is</w:t>
                </w:r>
            </w:p>
    

    As you can see, the text in one paragraph is never only in one stroke.

    Answer to your questions:

    1. I think the only way to store a docx in a database would be to store the Full XMLs and images (or the docx as a ByteArray)
    2. To modify the text in one paragraph, wou could search for all <w:t> tags and group them by the same <w:p> tags. For example 'Hello' and 'My name' are in the same <w:p>.You would then find a way to know where the text has been inserted, and insert the text in the right <w:t>
    3. This is just about zipping the XMLs and images backtogether