Search code examples
pythonxmldocx

Clear new lines in docx


I've a docx file, this contains a lot of new lines between sections, I need to clear a new line when it appears on more than one occasion consecutively. I unzip the file using:

z = zipfile.ZipFile('File.docx','a')
z.extractall()

Inside of a directory: word, is a file document.xml, this contains all the data, but i don't get how to know in xml where's a new line.

I Know that extract it is not the solution (I use here just only to show where is the file). I think i can use:

z.write('Document.xml')

Can anyone help me?


Solution

  • The code from tlewis is for finding a particular text from the docx and replace it. In your case, there's something else to do: detect the new lines, and see if they are more than two new lines in a row. In word, a newline is just a paragraph (<w:p> tag) without any text inside.

    I have added some comments that will show you how to use the zip.

    import zipfile #Import the zip Module
    from lxml import etree #Useful to transform string into xml, and xml into string
    templateDocx = zipfile.ZipFile("C:/Template.docx") #Here is the path to the file you want to import
    newDocx = zipfile.ZipFile("C:/NewDocument.docx", "a") #This is the name of the outputed file
    
    #Open the document.xml file, the file that contains the content 
    with open(templateDocx.extract("word/document.xml", "C:/") as tempXmlFile:
        tempXmlStr = tempXmlFile.read()  
    
    
    tempXmlXml= etree.fromstring(tempXmlStr)   #Convert the string into XML
    ############
    # Algorithm detailled at the bottom, 
    # You have to write here the code to select all <w:p> tags, look if there is a <w:t> tag.
    ############
    
    tempXmlStr = etree.tostring(tempXmlXml, pretty_print=True) # Convert the changed XML into a string
    
    with open("C:/temp.xml", "w+") as tempXmlFile:
        tempXmlFile.write(tempXmlStr) #Write the changed file
    
    for file in templateDocx.filelist:
        if not file.filename == "word/document.xml":
            newDocx.writestr(file.filename, templateDocx.read(file)) #write all files except the changed ones in the zipArchive
    
    newDocx.write("C:/temp.xml", "word/document.xml") #write the document.xml file
    
    templateDocx.close() #Close both template And new Docx
    newDocx.close() # Close
    

    How to write the algorithm to remove the multiple new lines

    Here is a Sample Doc I have Created:

    Many Lines Docx

    Here is the corresponding code of document.xml:

     <w:p w:rsidR="006C517B" w:rsidRDefault="00761A87">
             <w:bookmarkStart w:id="0" w:name="_GoBack" />
             <w:bookmarkEnd w:id="0" />
             <w:r>
                <w:t>First Line</w:t>
             </w:r>
          </w:p>
          <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
          <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87">
             <w:proofErr w:type="spellStart" />
             <w:r>
                <w:t>Third</w:t>
             </w:r>
             <w:proofErr w:type="spellEnd" />
             <w:r>
                <w:t xml:space="preserve"> Line</w:t>
             </w:r>
          </w:p>
          <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
          <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
          <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87">
             <w:r>
                <w:t>Six Line</w:t>
             </w:r>
          </w:p>
          <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
          <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
          <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
          <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87">
             <w:proofErr w:type="spellStart" />
             <w:r>
                <w:t>Ten</w:t>
             </w:r>
             <w:proofErr w:type="spellEnd" />
             <w:r>
                <w:t xml:space="preserve"> Line</w:t>
             </w:r>
          </w:p>
          <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87">
             <w:proofErr w:type="spellStart" />
             <w:r>
                <w:t>Eleven</w:t>
             </w:r>
             <w:proofErr w:type="spellEnd" />
             <w:r>
                <w:t xml:space="preserve"> Line</w:t>
             </w:r>
          </w:p>
    

    As you can see, a new line is a empty <w:p>, like this one:

    <w:p w:rsidR="00761A87" w:rsidRDefault="00761A87" />
    

    To remove the multiple new Lines, check if they are multiple empty <w:p>, and remove all but the first.

    Hope that helps!