Search code examples
pythonfile-conversionpython-docx

DOCX file to text file conversion using Python


I wrote the following code to convert my docx file to text file. The output that I get printed in my text file is the last paragraph/part of the whole file and not the complete content. The code is as follows:

from docx import Document
import io
import shutil

def convertDocxToText(path):
    for d in os.listdir(path):
        fileExtension=d.split(".")[-1]
        if fileExtension =="docx":
            docxFilename = path + d
            print(docxFilename)
            document = Document(docxFilename)


# for printing the complete document
            print('\nThe whole content of the document:->>>\n')
            for para in document.paragraphs:
                textFilename = path + d.split(".")[0] + ".txt"
                with io.open(textFilename,"w", encoding="utf-8") as textFile:
                    #textFile.write(unicode(para.text))
                    x=unicode(para.text)
                    print(x) //the complete content gets printed by this line
                    textFile.write((x)) #after writing the content to text file only last paragraph is copied.
                #textFile.write(para.text)

path= "/home/python/resumes/"
convertDocxToText(path)

Solution

  • Problem

    as your code says in the last for loop:

            for para in document.paragraphs:
                textFilename = path + d.split(".")[0] + ".txt"
                with io.open(textFilename,"w", encoding="utf-8") as textFile:
                    x=unicode(para.text)
                    textFile.write((x))
    

    for each paragraph in whole document, you try to open a file named textFilename so let's say you have a file named MyFile.docx in /home/python/resumes/ so the textFilename value that contains the path will be /home/python/resumes/MyFile.txt always in whole of for loop, so the problem is that you open the same file in w mode which is a Write mode, and will overwrite the whole file content.

    Solution:

    you must open the file once out of that for loop then try add paragraphs one by one to it.