Search code examples
python-2.7beautifulsoupurl-encoding

Read multilanguage strings from html via Python 2.7


I am new in python 2.7 and I am trying to extract some info from html files. More specifically, I wand to read some text information that contains multilanguage information. I give my script hopping to make things more clear.

import urllib2
import BeautifulSoup

url = 'http://www.bbc.co.uk/zhongwen/simp/'

page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup.BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})

print data[0]['content'].encode("utf-8")

the result I am taking is

BBCϊ╕φόΨΘύ╜ΣΎ╝Νϊ╕╗ώκ╡Ύ╝Νbbcchinese.com, email news, newsletter, subscription, full text

The problem is in the first string. Is there any way to print what exactly I am reading? Also is there any way to find the exact encoding of the language of each script?

PS: I would like to mention that the site selected totally randomly as it is representative to the problem I am encountering.

Thank you in advance!


Solution

  • You have problem with the terminal where you are outputting the result. The script works fine and if you output data to file you will get it correctly.

    Example:

    import urllib2
    from bs4 import BeautifulSoup
    
    url = 'http://www.bbc.co.uk/zhongwen/simp/'
    
    page = urllib2.urlopen(url).read().decode("utf-8")
    dom = BeautifulSoup(page)
    data = dom.findAll('meta', {'name' : 'keywords'})
    
    with open("test.txt", "w") as myfile:
        myfile.write(data[0]['content'].encode("utf-8"))
    

    test.txt:

    BBC中文网,主页,bbcchinese.com, email news, newsletter, subscription, full text  
    

    Which OS and terminal you are using?