Search code examples
pythonweb-scrapingbeautifulsoupurllib2

Using Beautifulsoup and Urllib2 in Python, how can I find the data surrounded by specific tags?


As an introduction to BeautifulSoup and Urllib2, I thought that I would make a basic scraping program which gets information about a given player in a video game website called lolking.net. Each user has a scrambled URL that does not include their username, so I would have to scrape the URL extension of the player from the HTML of the site in order to be able to access their user page.

Here is an example string which I might encounter:

<div class="search_result_item" onclick="window.location='/summoner/na/26670961'; return false;"><div style="display: table-cell; text-align: center; padding: 10px 10px 16px;"><div style="font-size: 14px; display: block;">

I need to extract the bit of numbers after the /summoner/na/ part. How would I do that?


Solution

  • Let's demonstrate with Google since I don't know the particulars of the site in question (and the normal workflow would start with the whole page).

    import urllib2
    from bs4 import BeautifulSoup
    html = urllib2.urlopen( "http://www.google.com" ).read()
    soup = BeautifulSoup( html )
    

    A natural way to proceed for you is:

    • find all divs with the CSS class "search_class_item"
    • take the onclick attribute for these
    • match with a regex on the Javascript code in this attribute (I won't do this part here)

    On Google, let's find all links (A tags) with the CSS class "gb1" and find their href attribute. The analogy should be fairly straightforward.

    for tag in soup.find_all( "a", { "class" : "gb1" } ):
        print tag["href"]
    

    This example might have been a little too simple -- it misses the fact that the "tag" object, much like the "soup" object, will have a "find_all" method (and other similar methods). So if you need to explore in a way that involves making more layers of nesting explicit this is possible. There are also other ways to match than find_all() by tag and class. Refer to the documentation for BeautifulSoup to see exactly what is possible.