Search code examples
pythonunicodepython-docx

Writing Unicode to .docx file


Is there any way to write unicode characters to docx files? I tried python-docx but, it's giving me TypeError.

Traceback

Traceback (most recent call last):

File "< ipython-input-1-ba89c735995d >", line 1, in runfile('H:/Python/Practice/new/download.py', wdir='H:/Python/Practice/new')

File "C:\Program Files\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile execfile(filename, namespace)

File "C:\Program Files\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile exec(compile(f.read(), filename, 'exec'), namespace)

File "H:/Python/Practice/new/download.py", line 37, in document.add_paragraph(story.encode("utf-8"))

File "C:\Program Files\Anaconda3\lib\site-packages\docx\document.py", line 63, in add_paragraph return self._body.add_paragraph(text, style)

File "C:\Program Files\Anaconda3\lib\site-packages\docx\blkcntnr.py", line 36, in add_paragraph paragraph.add_run(text)

File "C:\Program Files\Anaconda3\lib\site-packages\docx\text\paragraph.py", line 37, in add_run run.text = text

File "C:\Program Files\Anaconda3\lib\site-packages\docx\text\run.py", line 163, in text self._r.text = text

File "C:\Program Files\Anaconda3\lib\site-packages\docx\oxml\text\run.py", line 104, in text _RunContentAppender.append_to_run_from_text(self, text)

File "C:\Program Files\Anaconda3\lib\site-packages\docx\oxml\text\run.py", line 134, in append_to_run_from_text appender.add_text(text)

File "C:\Program Files\Anaconda3\lib\site-packages\docx\oxml\text\run.py", line 142, in add_text self.add_char(char)

File "C:\Program Files\Anaconda3\lib\site-packages\docx\oxml\text\run.py", line 156, in add_char elif char in '\r\n':

TypeError: 'in < string >' requires string as left operand, not int

I was trying to scrape a website and write the texts to a MS Word file. Texts are in local language (Bangla). When I print story on console it prints the whole texts perfectly.

Code

import requests
from docx import Document
from bs4 import BeautifulSoup

url = "some url"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

story = soup.find("div", {"dir": "ltr"}).get_text().replace("<br />", "\n", len(response.text))

title = "title"
document = Document()
document.add_heading(title, 0)
document.add_paragraph(story.encode("utf-8"))
document.save(title + ".docx")

Solution

  • In Python 3, strings (should) automatically support Unicode, so you don't need to do anything special like encoding-to-UTF-8.

    document.add_paragraph(story)
    #                      ^ just do this directly, no need to call `.encode('utf-8')`
    

    While you can write Unicode characters into the document, the default text rendering engine used by the word processor may not support it. You may need to specify a font to ensure the characters won't become □□□□.

    run = document.add_paragraph(story).add_run()
    font = run.font
    font.name = 'Vrinda'