£ displaying in urllib2 and Beautiful Soup

I'm trying to write a small web scraper in python, and I think I've run into an encoding issue. I'm trying to scrape http://www.resident-music.com/tickets (specifically the table on the page) - a row might look something like this -

    <tr>
        <td style="width:64.9%;height:11px;">
         <p><strong>the great escape 2017&nbsp; local early bird tickets, selling fast</strong></p>
        </td>
        <td style="width:13.1%;height:11px;">
         <p><strong>18<sup>th</sup>&ndash; 20<sup>th</sup> may</strong></p>
        </td>
        <td style="width:15.42%;height:11px;">
         <p><strong>various</strong></p>
        </td>
        <td style="width:6.58%;height:11px;">
         <p><strong>&pound;55.00</strong></p>
        </td>
       </tr>

I'm essentially trying to replace the £55.00 with £55, and any other 'non-text' nasties.

I've tried a few different encoding things you can go with beautifulsoup, and urllib2 - to no avail, I think I'm just doing it all wrong.

Thanks

Solution

You want to unescape the html which you can do using html.unescape in python3:

In [14]: from html import unescape

In [15]: h = """<tr>
   ....:         <td style="width:64.9%;height:11px;">
   ....:          <p><strong>the great escape 2017&nbsp; local early bird tickets, selling fast</strong></p>
   ....:         </td>
   ....:         <td style="width:13.1%;height:11px;">
   ....:          <p><strong>18<sup>th</sup>&ndash; 20<sup>th</sup> may</strong></p>
   ....:         </td>
   ....:         <td style="width:15.42%;height:11px;">
   ....:          <p><strong>various</strong></p>
   ....:         </td>
   ....:         <td style="width:6.58%;height:11px;">
   ....:          <p><strong>&pound;55.00</strong></p>
   ....:         </td>
   ....:        </tr>"""

In [16]: 

In [16]: print(unescape(h))
<tr>
        <td style="width:64.9%;height:11px;">
         <p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p>
        </td>
        <td style="width:13.1%;height:11px;">
         <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
        </td>
        <td style="width:15.42%;height:11px;">
         <p><strong>various</strong></p>
        </td>
        <td style="width:6.58%;height:11px;">
         <p><strong>£55.00</strong></p>
        </td>
       </tr>

For python2 use:

In [6]: from html.parser import HTMLParser

In [7]: unescape = HTMLParser().unescape  

In [8]: print(unescape(h))
<tr>
        <td style="width:64.9%;height:11px;">
         <p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p>
        </td>
        <td style="width:13.1%;height:11px;">
         <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
        </td>
        <td style="width:15.42%;height:11px;">
         <p><strong>various</strong></p>
        </td>
        <td style="width:6.58%;height:11px;">
         <p><strong>£55.00</strong></p>
        </td>

You can see both correctly unescape all entities not just the pound sign.

&pound; displaying in urllib2 and Beautiful Soup

£ displaying in urllib2 and Beautiful Soup