I am trying to convert docx files to text but keep getting an error. I am using python 2-7
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
Traceback:
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2019' in position 764: character maps to <undefined>
It looks like it doesn't like \u2019 and probably \u2018 either. These are the left and right single quotes. I'd encode the unicode data to ascii and ignore anything that it can't convert in order to remove them:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
txt = para.text.encode('ascii', 'ignore')
fullText.append(txt)
return '\n'.join(fullText)