Search code examples
pythonhtml-encode

Encode in scraping data with Python


I want to scrape the content of websites with Python. Just like this:

Apple’s stock continued to dominate the news over the weekend, with Barron’s placing it on the top of its favorite 2013 stock list.

But print them with error result:

Apple âs stock continued to dominate the news over the weekend, with Barronâs placing it on the top of its favorite 2013 stock list.

The symbol "’" can't be shown, here is my code:

    #-*- coding: utf-8 -*-

    import sys
    reload(sys)
    sys.setdefaultencoding('utf-8')
    import urllib
    from lxml import *
    import urllib
    import lxml.html as HTML

    url = "http://www.forbes.com/sites/panosmourdoukoutas/2012/12/09/apple-tops-barrons- 10-favorite-stocks-for-2013/?partner=yahootix"
    sock = urllib.urlopen(url)
    htmlSource = sock.read()
    sock.close()

    root = HTML.document_fromstring(htmlSource)
    contents = ' '.join([x.strip() for x in root.xpath("//div[@class='body']/descendant::text()")])

    print contents

    f = open('C:/Users/yinyao/Desktop/Python Code/data.txt','w')
    f.write(contents)
    f.close()

However, after setting, the function of printf is not useful. Why? And what should I do? I'm using Windows, and the default encoding approach is gbk.


Solution

  • First, ensure that you know The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

    Second, always use unicode internally. Decode early, encode late: when you scrap a website, decode it to unicode and process it as unicode internally in your script. Otherwise your code will crash at random points, for example because it encountered an unexpected character in a comment in some webpage in Chinese. Only when you pass it later somewhere (e.g., to some writeable stream) you should encode it ("utf-8" preferably)

    Third, use BeautifulSoup 4