I want to get the plain text of some docx files using python-docx
, but I'm struggling with the accents since the text is written in Spanish.
I'm using this answer to read the text:
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text('utf-8'))
return '\n'.join(fullText)
Which returns things like this:
n\xc3\xbamero //should be número
Is there a way I can get the text with the correct accents?
When I try to write this text to a file using this:
file = open("/mnt/c/Users/lulas/Desktop/inSpanish/txt/course1.txt","w")
file.write(text)
I get this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 27: ordinal not in range(128)
And it is due to how the accents are read/encoded.
There is no text but encoded text.
You are creating a text file. A text file is written with a character encoding. The error says the text you are writing to it includes characters that your character encoding doesn't support.
So, you must either choose a different encoding or not write those characters. Keep in mind 1) the reader must know which encoding the file uses so that must be communicated and/or agreed upon. 2) The original characters might be highly valued so dropping or replacing them could be a poor choice.
Since the source file (docx) uses the Unicode character set, a Unicode encoding might be the optimal choice. For storing and streaming Unicode, UTF-8 is the most common encoding. So,
file = open("/mnt/c/Users/lulas/Desktop/inSpanish/txt/course1.txt","w", encoding="utf-8")
file.write(text)
I don't think the problem is with reading. n\xc3\xbamero is a representation of número when encoded in UTF-8. Whatever is showing you it that is just trying to be "helpful".