Search code examples

Read multilanguage strings from html via Python 2.7

I am new in python 2.7 and I am trying to extract some info from html files. More specifically, I wand to read some text information that contains multilanguage information. I give my script hopping to make things more clear.

import urllib2
import BeautifulSoup

url = ''

page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup.BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})

print data[0]['content'].encode("utf-8")

the result I am taking is

BBCϊ╕φόΨΘύ╜ΣΎ╝Νϊ╕╗ώκ╡Ύ╝Ν, email news, newsletter, subscription, full text

The problem is in the first string. Is there any way to print what exactly I am reading? Also is there any way to find the exact encoding of the language of each script?

PS: I would like to mention that the site selected totally randomly as it is representative to the problem I am encountering.

Thank you in advance!


  • You have problem with the terminal where you are outputting the result. The script works fine and if you output data to file you will get it correctly.


    import urllib2
    from bs4 import BeautifulSoup
    url = ''
    page = urllib2.urlopen(url).read().decode("utf-8")
    dom = BeautifulSoup(page)
    data = dom.findAll('meta', {'name' : 'keywords'})
    with open("test.txt", "w") as myfile:


    BBC中文网,主页,, email news, newsletter, subscription, full text  

    Which OS and terminal you are using?