Search code examples
pythonencodingbeautifulsoupurllib2

£ displaying in urllib2 and Beautiful Soup


I'm trying to write a small web scraper in python, and I think I've run into an encoding issue. I'm trying to scrape http://www.resident-music.com/tickets (specifically the table on the page) - a row might look something like this -

    <tr>
        <td style="width:64.9%;height:11px;">
         <p><strong>the great escape 2017&nbsp; local early bird tickets, selling fast</strong></p>
        </td>
        <td style="width:13.1%;height:11px;">
         <p><strong>18<sup>th</sup>&ndash; 20<sup>th</sup> may</strong></p>
        </td>
        <td style="width:15.42%;height:11px;">
         <p><strong>various</strong></p>
        </td>
        <td style="width:6.58%;height:11px;">
         <p><strong>&pound;55.00</strong></p>
        </td>
       </tr>

I'm essentially trying to replace the &pound;55.00 with £55, and any other 'non-text' nasties.

I've tried a few different encoding things you can go with beautifulsoup, and urllib2 - to no avail, I think I'm just doing it all wrong.

Thanks


Solution

  • You want to unescape the html which you can do using html.unescape in python3:

    In [14]: from html import unescape
    
    In [15]: h = """<tr>
       ....:         <td style="width:64.9%;height:11px;">
       ....:          <p><strong>the great escape 2017&nbsp; local early bird tickets, selling fast</strong></p>
       ....:         </td>
       ....:         <td style="width:13.1%;height:11px;">
       ....:          <p><strong>18<sup>th</sup>&ndash; 20<sup>th</sup> may</strong></p>
       ....:         </td>
       ....:         <td style="width:15.42%;height:11px;">
       ....:          <p><strong>various</strong></p>
       ....:         </td>
       ....:         <td style="width:6.58%;height:11px;">
       ....:          <p><strong>&pound;55.00</strong></p>
       ....:         </td>
       ....:        </tr>"""
    
    In [16]: 
    
    In [16]: print(unescape(h))
    <tr>
            <td style="width:64.9%;height:11px;">
             <p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p>
            </td>
            <td style="width:13.1%;height:11px;">
             <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
            </td>
            <td style="width:15.42%;height:11px;">
             <p><strong>various</strong></p>
            </td>
            <td style="width:6.58%;height:11px;">
             <p><strong>£55.00</strong></p>
            </td>
           </tr>
    

    For python2 use:

    In [6]: from html.parser import HTMLParser
    
    In [7]: unescape = HTMLParser().unescape  
    
    In [8]: print(unescape(h))
    <tr>
            <td style="width:64.9%;height:11px;">
             <p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p>
            </td>
            <td style="width:13.1%;height:11px;">
             <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
            </td>
            <td style="width:15.42%;height:11px;">
             <p><strong>various</strong></p>
            </td>
            <td style="width:6.58%;height:11px;">
             <p><strong>£55.00</strong></p>
            </td>
    

    You can see both correctly unescape all entities not just the pound sign.