Search code examples
python-2.7beautifulsoupurlopen

Beautifulsoup fail to read page


I am trying the following:

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
soup = BeautifulSoup(urlopen(url).read())
print soup

The print statement above shows the following:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-type" content="text/html;charset=utf-8" />
<title>Travis Property Search</title>
<style type="text/css">
      body { text-align: center; padding: 150px; }
      h1 { font-size: 50px; }
      body { font: 20px Helvetica, sans-serif; color: #333; }
      #article { display: block; text-align: left; width: 650px; margin: 0 auto; }
      a { color: #dc8100; text-decoration: none; }
      a:hover { color: #333; text-decoration: none; }
    </style>
</head>
<body>
<div id="article">
<h1>Please try again</h1>
<div>
<p>Sorry for the inconvenience but your session has either timed out or the server is busy handling other requests. You may visit us on the the following website for information, otherwise please retry your search again shortly:<br /><br />
<a href="http://www.traviscad.org/">Travis Central Appraisal District Website</a> </p>
<p><b><a href="http://propaccess.traviscad.org/clientdb/?cid=1">Click here to reload the property search to try again</a></b></p>
</div>
</div>
</body>
</html>

I am able to access the url however through the browser on same computer so the server is definitely not blocking my IP. I don't understand what is wrong with my code?


Solution

  • You need to get some cookies first, then you can visit the url.
    Although this can be done with urllib2 and CookieJar , i recommend requests :

    import requests
    from BeautifulSoup import BeautifulSoup
    
    url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1'
    url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
    ses = requests.Session()
    ses.get(url1)
    soup = BeautifulSoup(ses.get(url).content)
    print soup.prettify()
    

    Note that requests is not a standard lib, you'll have to insall it. If you want to use urllib2 :

    import urllib2
    from cookielib import CookieJar
    
    url1 = 'http://propaccess.traviscad.org/clientdb/?cid=1'
    url = 'http://propaccess.traviscad.org/clientdb/Property.aspx?prop_id=312669'
    cj = CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    opener.open(url1)
    soup = BeautifulSoup(opener.open(url).read())
    print soup.prettify()