Search code examples
pythonencodingpython-unicodenon-latin

Keep non-Latin characters when scraping page in python


I have a program that scrapes a page, parses it for any links, then downloads the pages linked to (sounds like a crawler, but it's not) and saves each one in a separate file. The file name used to save is part of the url of the page. So for instance, if I find a link to www.foobar.com/foo, I would download the page and save it in a file entitled foo.xml.

Later, I need to loop through all such files and re-download them, using the file name as the last part of the url. (All pages are from a single site.)

It works well, until I encounter a non-Latin character in a url. The site uses utf-8, so when I download the original page and decode it, it works fine. But when I try to use the decoded url to download the corresponding page, it doesn't work, because, I assume, the encoding is wrong. I've tried using .encode() on the filename to change it back, but it doesn't change anything.

I know this must be very simple and a result of my not understanding encoding issues properly, but I've been cracking my head on it for a long time. I've read Joel Spolsky's introduction to encoding several times, but I still can't quite work out what to do here. Can anyone help me?

Thanks a lot, bsg

Here's some code. I don't get any errors; but when I try to download the page using the pagename as part of the url, I get told that that page doesn't exist. Of course it doesn't - there's no such page as abc/x54.

To clarify: I download the html of a page which includes a link to www.foobar.com/Mehmet Kenan Dalbaşar , e.g., but it shows up as Mehmet_Kenan_Dalba%C5%9Far. When I try to download the page www.foobar.com/Mehmet_Kenan_Dalba%C5%9Far, the page is blank. How do I keep www.foobar.com/Mehmet Kenan Dalbaşar and return it to the site when I need to?

try:
    params = urllib.urlencode({'title': 'Foo', 'action': 'submit'})
    req = urllib2.Request(url='foobar.com',data=params, headers=headers)
    f = urllib2.urlopen(req)

    encoding = f.headers.getparam('charset')

    temp = f.read() .decode(encoding)

    #lots of code to parse out the links

    for line in links:
    try:
        pagename = line
        pagename = pagename.replace('\n', '')
        print pagename

        newpagename = pagename.replace(':', '_')
        newpagename = newpagename.replace('/', '_')
        final = os.path.join(fullpath, newpagename)
        print final
        final = final.encode('utf-8')
        print final

         ##only download the page if it hasn't already been downloaded
        if not os.path.exists(final + ".xml"):
                print "doesn't exist"
                save = open(final + ".xml", 'w')
                save.write(f.read())
                save.close()

Solution

  • If you have a url with e.g. the code '%C5' and want to obtain it with the actual character \xC5, then call urllib.unquote() on the url.