Search code examples
pythonunicode

Can't open Unicode URL with Python


Using Python 2.5.2 and Linux Debian, I'm trying to get the content from a Spanish URL that contains a Spanish char 'í':

import urllib
url = u'http://mydomain.es/índice.html'
content = urllib.urlopen(url).read()

I'm getting this error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 8: ordinal not in range(128)

I've tried using before passing the url to urllib this:

url = urllib.quote(url)

and this:

url = url.encode('UTF-8')

but they didn't work.

Can you tell me what I am doing wrong ?


Solution

  • Per the applicable standard, RFC 1738, URLs can only contain ASCII characters. Good explanation here, and I quote:

    "...Only alphanumerics [0-9a-zA-Z], the special characters "$-_.+!*'()," [not including the quotes - ed], and reserved characters used for their reserved purposes may be used unencoded within a URL."

    As the URLs I've given explain, this probably means you'll have to replace that "lowercase i with acute accent" with `%ED'.