As an introduction to BeautifulSoup and Urllib2, I thought that I would make a basic scraping program which gets information about a given player in a video game website called lolking.net. Each user has a scrambled URL that does not include their username, so I would have to scrape the URL extension of the player from the HTML of the site in order to be able to access their user page.
Here is an example string which I might encounter:
<div class="search_result_item" onclick="window.location='/summoner/na/26670961'; return false;"><div style="display: table-cell; text-align: center; padding: 10px 10px 16px;"><div style="font-size: 14px; display: block;">
I need to extract the bit of numbers after the /summoner/na/
part. How would I do that?
Let's demonstrate with Google since I don't know the particulars of the site in question (and the normal workflow would start with the whole page).
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen( "http://www.google.com" ).read()
soup = BeautifulSoup( html )
A natural way to proceed for you is:
On Google, let's find all links (A tags) with the CSS class "gb1" and find their href attribute. The analogy should be fairly straightforward.
for tag in soup.find_all( "a", { "class" : "gb1" } ):
print tag["href"]
This example might have been a little too simple -- it misses the fact that the "tag" object, much like the "soup" object, will have a "find_all" method (and other similar methods). So if you need to explore in a way that involves making more layers of nesting explicit this is possible. There are also other ways to match than find_all() by tag and class. Refer to the documentation for BeautifulSoup to see exactly what is possible.