Search code examples
pythonregexdocxtext-extractionpython-docx

How do you write text extracted from PDF (using textract) to docx files in python


I have several articles in a single pdf file and I am trying to separate those articles and write them to separate Docx files. I managed to separate them using regex but when I try to write them to docx files, it throws this error: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters.

My code is as follows:

my_path = "/path/to/pdf"

newpath = textract.process(my_path)
newpath2 = newpath.decode("UTF-8")

result = re.findall(r'\d+ words(.*?)Document \w+', newpath2, re.DOTALL)


save_path = "/path/to/write/docx/files/"

for each in result:
    import time
    time=str(time.time())
    finalpath = (os.path.join(save_path, time))
    finalpath2 = finalpath+".docx"
    mydoc = docx.Document()
    mydoc.add_paragraph(each)
    mydoc.save(finalpath2)

Solution

  • You can remove all null and control byte chars and use

    .add_paragraph(remove_control_characters(each.replace('\x00','')))
    

    The remove_control_characters function can be borrowed from Removing control characters from a string in python thread.

    Code snippet:

    import unicodedata
    def remove_control_characters(s):
        return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")
    
    my_path = "/path/to/pdf"
    
    newpath = textract.process(my_path)
    newpath2 = newpath.decode("UTF-8")
    
    result = re.findall(r'\d+ words(.*?)Document \w+', newpath2, re.DOTALL)
    
    save_path = "/path/to/write/docx/files/"
    
    for each in result:
        import time
        time=str(time.time())
        finalpath = (os.path.join(save_path, time))
        finalpath2 = finalpath+".docx"
        mydoc = docx.Document()
        mydoc.add_paragraph(remove_control_characters(each.replace('\x00','')))
        mydoc.save(finalpath2)