I have this issue trying to get all the text nodes in an HTML document using lxml but I get an UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128)
. However, when I try to find out the type of encoding of this page (encoding = chardet.detect(response)['encoding']
), it says it's utf-8
. It seems weird that a single page has utf-8 and ascii. Actually, this:
fromstring(response).text_content().encode('ascii', 'replace')
solves the problem.
Here it's my code:
from lxml.html import fromstring
import urllib2
import chardet
request = urllib2.Request(my_url)
request.add_header('User-Agent',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0)')
request.add_header("Accept-Language", "en-us")
response = urllib2.urlopen(request).read()
print encoding
print fromstring(response).text_content()
Output:
utf-8
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 8995: ordinal not in range(128)
What can I do to solve this issue?. Keep in mind that I want to do this with a few other pages, so I don't want to encode on an individual basis.
UPDATE:
Maybe there is something else going on here. When I run this script on the terminal, I get a correct output but when a run it inside SublimeText, I get UnicodeEncodeError... ¿?
UPDATE2:
It's also happening when I create a file with this output. .encode('ascii', 'replace')
is working but I'd like to have a more general solution.
Regards
Can you try wrapping your string with repr()? This article might help.
print repr(fromstring(response).text_content())