Search code examples
pythonhtmlbeautifulsouphref

Python html parsing using beautifulsoup framework


I'm using Beauitful soup framework to retreive the link (href from the below html content)

         <div class="store">
               <label>Store</label>
                 <span>
                   <a title="Open in Google Play" href="https://play.google.com/store/apps/details?id=com.opera.mini.android" target="_blank">
                        <!-- ><span class="ui-icon app-store-gp"></span> -->
                        Google Play
                   </a><i class="icon-external-link"></i>
                 </span>
             </div>

I used the following code to retrieve this in python:

 pageFile = urllib.urlopen("appannie.com/apps/google-play/app/com.opera.mini.android")
 pageHtml = pageFile.read()
 pageFile.close()
 print pageHtml
 soup = BeautifulSoup("".join(pageHtml))
 item = soup.find("a", {"title":"Open in Google Play"})

 print item

I get NoneType as the output. Any help would be really great.

I printed out the html page and the output was as follows:

  <html>
  <head><title>503 Service Temporarily Unavailable</title></head>
  <body bgcolor="white">
  <center><h1>503 Service Temporarily Unavailable</h1></center>
  <hr><center>nginx</center>
  </body>
  </html>

It works fine on the browser


Solution

  • item = soup.find("a", {"title":"Open in Google Play"})
    

    You were initially searching for a "span" with a title "Open in Google Play", however the element that you're looking for is an "a" (a link).

    Edit: since it appears that the server returns a 503 error, try setting a common user-agent with this code (not tested, it may not work at all; you'll need to import urllib2) :

    soup = BeautifulSoup(urllib2.urlopen(urllib2.Request(sampleURL, None, {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0"})).read())
    item = soup.find("a", {"title":"Open in Google Play"}) 
    print item
    

    Also I removed the useless "".join(pageHtml) since urllib2 already returns strings so there's no need for join.