Search code examples
pythonmechanizemechanize-python

Scraping of website that present a choice after login before retrieve the searched page


I try to scrape a website that has a strange behavior. I point as URL the page I want to retrieve, as normal website present me login page, I submit the form elements and then I want to scrape the page but after I submit the form the website present me a page with a choice (two links) to choose my profile, after the click on a chosed profile I can access the page I want. In mechanize I can't click on a link to retrieve the page I want to read. This is my code:

from bs4 import BeautifulSoup as bs
import urllib3
import mechanize
import cookielib
cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_cookiejar(cj)
br.open("the_url_I_want_scrape")
br.select_form(nr=2)
br.form.set_all_readonly(False)
br.form['username'] = "my_user"
br.form["password"] = "my_pass"
br.form["button.submit"] = "entra"
br.submit()
html = br.response().read()

Now if i iterate in a links I have two objects:

for link in br.links():
    print link

That it's look like follow lines:

Link(base_url='https://www.sito.com/internal/login', url='/internal/sessionProperty?sessid=1111', text='Profile1', tag='a', attrs=[('href', '/nternal/sessionProperty?sessid=1111')])
Link(base_url='https://www.sito.com/internal/login', url='/shres/internal/sessionProperty?sessid=3333', text='Profile2', tag='a', attrs=[('href', '/internal/sessionProperty?sessid=3333')])

How can I simulate a click on it and the parse the result page? I've tried to add abolute_url to the link and then use follow_link but it hangs and not respond anymore The code I use is:

for link in br.links():
  link.absolute_url = mechanize.urljoin(link.base_url,link.url)
  br.follow_link(link)

Someone can help me? Thank you Alex


Solution

  • I had similar experience when I needed to scrape website with heavy Javascript use (like hidden menus) and had to use Selenium to simulate browser behaviour instead of mechanize. You could try that.

    You could also track the POST request as stated in this question and try to simulate it.