Search code examples
pythonscreen-scrapingbeautifulsoup

Scrape Site With Incremental IDs


I have a site in the following format:

http://www.domain.com/membership/member_zoom.php?value

value starts at 1000 and stops around 15,000

Here is a sample of the source:

<h1>Member Information</h1>


<h2>Company Name</h2>
<p>Address<br />
More Address<br />
City<br />
State<br />
Postal code<br />
</p>
<p><strong>Contact:</strong> Firstname Lastname, PH.D., P.ENG. - <a href="mailto:[email protected]">[email protected]</a><br /></p>
<a href="http://www.domain.com">www.domain.com</a><br />
<p><strong>Phone:</strong> (555)555-5555<br /></p>

So, I need to grab everything between Member Information and the last div tag then increment the ID value 1, repeat. BUT, there are a lot of dead IDs. My scraper just hammers the site, incrementing once then hitting it again. Is there an easier way? Perhaps some way to build a failsafe in?


Solution

  • There's no way to tell whether an id exists until you try to load it and see whether it does. You'll need to find a list of links, or scrape member IDs from another part of the site. If you can't do that, you'll just have to try each one.