Search code examples
pythonwebhtml-parsingbeautifulsoupdownloading-website-files

Python - cannot access a specific div [Urllib, BeautifulSoup, maybe Mechanize?]


I have been breaking my head against this wall for a couple days now, so I thought I would ask the SO community. I want a python script that, among other things, can hit 'accept' buttons on forms on websites in order to download files. To that end, though, I need to get access to the form.

This is an example of the kind of file I want to download. I know that within it, there is an unnamed form with an action to accept the terms and download the file. I also know that the div that form can be found in is the main-content div.

However, whenever I BeautifulSoup parse the webpage, I cannot get the main-content div. The closest I've managed to get is the main_content link right before it, which does not provide me any information through BeautifulSoup's object.

Here's a bit of code from my script:

web_soup = soup(urllib2.urlopen(url))
parsed = list(urlparse(url))
ext = extr[1:]
for downloadable in web_soup.findAll("a"):
  encode = unicodedata.normalize('NFKD',downloadable.text).encode('UTF-8','ignore')
  if ext in str.lower(encode):
    if downloadable['href'] in url:
      return ("http://%s%s" % (parsed[1],downloadable['href']))
for div in web_soup.findAll("div"):
  if div.has_key('class'):
    print(div['class'])
    if div['class'] == "main-content":
      print("Yep")
return False

Url is the name of the url I am looking at (so the url I posted earlier). extr is the type of file I am hoping to download in the form .extension, but that is not really relevant to my question. The code that is relevant is the second for loop, the one where I am attempting to loop through the divs. The first bit of code(the first for loop) is code that goes through to grab download links in another case (when the url the script is given is a 'download link' marked by a file extension such as .zip with a content type of text/html), so feel free to ignore it. I added it in just for context.

I hope I provided enough detail, though I am sure I did not. Let me know if you need any more information on what I am doing and I will be happy to oblige. Thanks, Stack.


Solution

  • Here's the code for getting main-content div and form action:

    import re
    import urllib2
    from bs4 import BeautifulSoup as soup
    
    
    url = "http://www.cms.gov/apps/ama/license.asp?file=/McrPartBDrugAvgSalesPrice/downloads/Apr-13-ASP-Pricing-file.zip"
    web_soup = soup(urllib2.urlopen(url))
    
    # get main-content div
    main_div = web_soup.find(name="div", attrs={'class': 'main-content'})
    print main_div
    
    # get form action
    form = web_soup.find(name="form", attrs={'action': re.compile('.*\.zip.*')})
    print form['action']
    

    Though, if you need, I can provide examples for lxml, mechanize or selenium.

    Hope that helps.