Search code examples
pythonhtmlbeautifulsoupencode

How to encode content to HTML within BeautifulSoup Python


I tried all I can do to encode the page then using BeautifulSoup. However, when I run, it shows the unicode results. Can anyone help me how to encode under BeautifulSoup

my code:

import httplib
import urllib
import urllib2
from BeautifulSoup import BeautifulSoup
import HTMLParser


headers={
'Host': 'digitalvita.pitt.edu',
'Connection': 'keep-alive',
'Origin': 'https://digitalvita.pitt.edu',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
'Content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': 'text/javascript, text/html, application/xml, text/xml, */*',
'Referer': 'https://digitalvita.pitt.edu/index.php',
'Accept-Encoding': 'gzip,deflate,sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Cookie': 'PHPSESSID=lvetilatpgs9okgrntk1nvn595'
}

data={
'action':'search',
'xdata':'<search id="1"><context type="all" /><results><ordering>familyName</ordering><pagesize>100000</pagesize><page>1</page></results><terms><name>d</name><school>All</school></terms></search>',
'request':'search'
}

data = urllib.urlencode(data)
print data
req = urllib2.Request('https://digitalvita.pitt.edu/dispatcher.php', data, headers)
response = urllib2.urlopen(req)
the_page = response.read()

htmlCodes = [
    ['&', '&amp;'],
    ['<', '&lt;'],
    ['>', '&gt;'],
    ['"', '&quot;'],
]
htmlCodesReversed = htmlCodes[:]
htmlCodesReversed.reverse()

def htmlEncode(s, codes=htmlCodes):
    """ Returns the HTML encoded version of the given string. This is useful to
        display a plain ASCII text string on a web page."""
    for code in codes:
        s = s.replace(code[1], code[0])
    return s
s=htmlEncode(the_page,codes=htmlCodes)

h = HTMLParser.HTMLParser()
s=h.unescape(s)

s.encode("utf-8")

soup=BeautifulSoup(s,convertEntities=BeautifulSoup.HTML_ENTITIES)
print soup

The simple results is like:

¬¨‚Ć&lt;a href="#local" onclick="dvSearch.ToggleInterests(141432);"&gt;&lt;span class="iToggle" id="toggle_141432"&gt;more...&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Znati, Taieb&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;div class="professionalPosition"&gt;Computer Science, University of Pittsburgh&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zoffer, H&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;div class="professionalPosition"&gt;"KGSB-Dean, Office of", University of Pittsburgh&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zorn, Kristin&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zou, Chunbin&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;div class="researchInterest"&gt;&lt;b&gt;Research Interests: &lt;/b&gt;fatty liver disease; tyrosine kinase receptor; proteasome endopeptidase complex; phosphatidylcholines; trypanosome; Fas; ubiquitin; pulmonary surfactants; HGF/Met&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zou, Xiuying&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zrust, Marilyn&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;div class="professionalPosition"&gt;Clinical Instructor, Acute/Tertiary Care, University of Pittsburgh School of Nursing&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zubieta, Juan&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zuccoli, Giulio&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zuckerman, Daniel&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;div class="professionalPosition"&gt;Computational Biology, University of Pittsburgh&lt;/div&gt;&lt;div class="researchInterest"&gt;&lt;b&gt;Research Interests: &lt;/b&gt;structural biology; stochastic processes; computer simulation; coarse-grained models; protein dynamics and fluctuations; models, theoretical&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zuckoff, Allan&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;div class="professionalPosition"&gt;Psychology, University of Pittsburgh&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zuckoff, Allan&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;div class="professionalPosition"&gt;Psychiatry, University of Pittsburgh&lt;/div&gt;&lt;div class="researchInterest"&gt;&lt;b&gt;Research Interests: &lt;/b&gt;psychotherapy; substance-related disorders; motivational interviewing; grief treatment ; diagnosis, dual (psychiatry); treatment adherence; patient compliance; traumatic grief and substance abuse&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zukor, Tevya&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zuley, Margarita&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zunino, Paolo&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;div class="professionalPosition"&gt;Mech Eng and Materials Sci, University of Pittsburgh&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zureikat, Amer&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zutter, Chad&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;div class="professionalPosition"&gt;KGSB-Business Admin, University of Pittsburgh&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;&lt;div&gt;&lt;table width="100%" cellspacing="5" cellpadding="0"&gt;&lt;tr valign="top"&gt;&lt;td&gt;&lt;img width="70" height="70" src="http://digitalvita.pitt.edu/digital-vitaUI.profileimages/na.jpg" /&gt;&lt;/td&gt;&lt;td width="99%"&gt;&lt;div&gt;&lt;span class="name"&gt; Zyczynski, Halina&lt;/span&gt;&lt;span class="email"&gt; (&lt;a href="mailto:[email protected]"&gt;[email protected]&lt;/a&gt;) &lt;/span&gt;&lt;/div&gt;&lt;div class="professionalPosition"&gt;Obstetrics, Gynecology and Reproductive Sciences, University of Pittsburgh&lt;/div&gt;&lt;div class="researchInterest"&gt;&lt;b&gt;Research Interests: &lt;/b&gt;pelvic floor reconstruction; rectocele; uterine prolapse; sacralcolpopexy; bladder diseases; colpocleisis; pelic organ prolapse; urinary incontinence&lt;/div&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;&lt;/div&gt;

]]>


Solution

  • It looks like the problem is that you're mixing up character sets.

    The first thing I'd do is change your Accept-Charset so you only accept utf-8.

    'Accept-Charset': 'utf-8;q=0.7,*;q=0.3',
    

    Next, the result of response.read() is an 8-bit string, which you have to decode. Since we now know that it's utf-8, you can do this:

    the_page = response.read().decode('utf-8')
    

    With those two changes, when I run your script, the same fragment comes back as:

     … Self Care&lt;/span&gt;
                                                &lt;a href="#local" onclick="dvSearch.ToggleInterests(…
    

    No more garbage Unicode characters.

    Of course this only works because the server is willing to return utf-8. For a more general case, where you have some servers that can only do utf-8 and others that can only do Latin-1, you need to do something a bit more complicated. Leave the Accept-Charset header alone, and then change the read to look at the response headers. Something like this:

    response = urllib2.urlopen(req)
    charset = response.info().getencoding()
    the_page = response.read().decode(charset)
    

    There are many badly-configured servers that won't actually return a charset, even when they aren't returning pure 7-bit ASCII. In that case, you need to either examine what the server returns and hardcode the right answer, or write code to try to detect the proper charset on the fly. Hopefully you'll never run into this situation…