python regex docx text-extraction python-docx

How do you write text extracted from PDF (using textract) to docx files in python

I have several articles in a single pdf file and I am trying to separate those articles and write them to separate Docx files. I managed to separate them using regex but when I try to write them to docx files, it throws this error: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters.

My code is as follows:

my_path = "/path/to/pdf"

newpath = textract.process(my_path)
newpath2 = newpath.decode("UTF-8")

result = re.findall(r'\d+ words(.*?)Document \w+', newpath2, re.DOTALL)


save_path = "/path/to/write/docx/files/"

for each in result:
    import time
    time=str(time.time())
    finalpath = (os.path.join(save_path, time))
    finalpath2 = finalpath+".docx"
    mydoc = docx.Document()
    mydoc.add_paragraph(each)
    mydoc.save(finalpath2)

Solution

You can remove all null and control byte chars and use

.add_paragraph(remove_control_characters(each.replace('\x00','')))

The remove_control_characters function can be borrowed from Removing control characters from a string in python thread.

Code snippet:

import unicodedata
def remove_control_characters(s):
    return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")

my_path = "/path/to/pdf"

newpath = textract.process(my_path)
newpath2 = newpath.decode("UTF-8")

result = re.findall(r'\d+ words(.*?)Document \w+', newpath2, re.DOTALL)

save_path = "/path/to/write/docx/files/"

for each in result:
    import time
    time=str(time.time())
    finalpath = (os.path.join(save_path, time))
    finalpath2 = finalpath+".docx"
    mydoc = docx.Document()
    mydoc.add_paragraph(remove_control_characters(each.replace('\x00','')))
    mydoc.save(finalpath2)