I'm trying to write a small web scraper in python, and I think I've run into an encoding issue. I'm trying to scrape http://www.resident-music.com/tickets (specifically the table on the page) - a row might look something like this -
<tr>
<td style="width:64.9%;height:11px;">
<p><strong>the great escape 2017 local early bird tickets, selling fast</strong></p>
</td>
<td style="width:13.1%;height:11px;">
<p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
</td>
<td style="width:15.42%;height:11px;">
<p><strong>various</strong></p>
</td>
<td style="width:6.58%;height:11px;">
<p><strong>£55.00</strong></p>
</td>
</tr>
I'm essentially trying to replace the £55.00
with £55, and any other 'non-text' nasties.
I've tried a few different encoding things you can go with beautifulsoup, and urllib2 - to no avail, I think I'm just doing it all wrong.
Thanks
You want to unescape the html which you can do using html.unescape in python3:
In [14]: from html import unescape
In [15]: h = """<tr>
....: <td style="width:64.9%;height:11px;">
....: <p><strong>the great escape 2017 local early bird tickets, selling fast</strong></p>
....: </td>
....: <td style="width:13.1%;height:11px;">
....: <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
....: </td>
....: <td style="width:15.42%;height:11px;">
....: <p><strong>various</strong></p>
....: </td>
....: <td style="width:6.58%;height:11px;">
....: <p><strong>£55.00</strong></p>
....: </td>
....: </tr>"""
In [16]:
In [16]: print(unescape(h))
<tr>
<td style="width:64.9%;height:11px;">
<p><strong>the great escape 2017 local early bird tickets, selling fast</strong></p>
</td>
<td style="width:13.1%;height:11px;">
<p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
</td>
<td style="width:15.42%;height:11px;">
<p><strong>various</strong></p>
</td>
<td style="width:6.58%;height:11px;">
<p><strong>£55.00</strong></p>
</td>
</tr>
For python2 use:
In [6]: from html.parser import HTMLParser
In [7]: unescape = HTMLParser().unescape
In [8]: print(unescape(h))
<tr>
<td style="width:64.9%;height:11px;">
<p><strong>the great escape 2017 local early bird tickets, selling fast</strong></p>
</td>
<td style="width:13.1%;height:11px;">
<p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
</td>
<td style="width:15.42%;height:11px;">
<p><strong>various</strong></p>
</td>
<td style="width:6.58%;height:11px;">
<p><strong>£55.00</strong></p>
</td>
You can see both correctly unescape all entities not just the pound sign.