Search code examples
pythonpython-2.7docx

Converting Docx to pure text


I am trying to convert docx files to text but keep getting an error. I am using python 2-7

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)

Traceback:

return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 764: character maps to <undefined>

Solution

  • It looks like it doesn't like \u2019 and probably \u2018 either. These are the left and right single quotes. I'd encode the unicode data to ascii and ignore anything that it can't convert in order to remove them:

    import docx
    
    def getText(filename):
        doc = docx.Document(filename)
        fullText = []
        for para in doc.paragraphs:
            txt = para.text.encode('ascii', 'ignore')
            fullText.append(txt)
        return '\n'.join(fullText)