Search code examples
pythoncharacter-encodingdocxpython-docx

Reading accents with python-docx


I want to get the plain text of some docx files using python-docx, but I'm struggling with the accents since the text is written in Spanish.

I'm using this answer to read the text:

def getText(filename):
  doc = docx.Document(filename)
  fullText = []
  for para in doc.paragraphs:
      fullText.append(para.text('utf-8'))
  return '\n'.join(fullText)

Which returns things like this:

n\xc3\xbamero //should be número

Is there a way I can get the text with the correct accents?

When I try to write this text to a file using this:

file = open("/mnt/c/Users/lulas/Desktop/inSpanish/txt/course1.txt","w")
file.write(text)

I get this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 27: ordinal not in range(128)

And it is due to how the accents are read/encoded.


Solution

  • There is no text but encoded text.

    You are creating a text file. A text file is written with a character encoding. The error says the text you are writing to it includes characters that your character encoding doesn't support.

    So, you must either choose a different encoding or not write those characters. Keep in mind 1) the reader must know which encoding the file uses so that must be communicated and/or agreed upon. 2) The original characters might be highly valued so dropping or replacing them could be a poor choice.

    Since the source file (docx) uses the Unicode character set, a Unicode encoding might be the optimal choice. For storing and streaming Unicode, UTF-8 is the most common encoding. So,

    file = open("/mnt/c/Users/lulas/Desktop/inSpanish/txt/course1.txt","w", encoding="utf-8")
    file.write(text)
    

    I don't think the problem is with reading. n\xc3\xbamero is a representation of número when encoded in UTF-8. Whatever is showing you it that is just trying to be "helpful".