I am using mechanize to log into a website and scrape it with beautifulsoap. While I got it working without using functions, I don't know how to put the login-functionality into a function and then later use it in the main program. Here's my not working code so far:
#!/usr/bin/env python
import http.cookiejar as cookielib
import mechanize
from bs4 import BeautifulSoup
def set_browser():
br = mechanize.Browser()
cookiejar = cookielib.LWPCookieJar()
br.set_cookiejar(cookiejar)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time = 1)
br.addheaders = [( 'User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1' )]
return br
def login(br):
br.open("https://example.com/login/index.php")
br.select_form(nr=0)
br.form['username'] = "admin"
br.form['password'] = "mypassword"
br.submit()
def scrape():
url = "https://example.com/content"
data = br.open(url).get_data()
soup = BeautifulSoup(data, 'html.parser')
with open("source.html", "w") as text_file:
print(soup.prettify(), file=text_file)
if __name__ == "__main__":
set_browser()
login(br)
scrape()
I hope someone can help me how to write proper functions. In my above code, I wrote two functions set_browser() and login() but it isn't important to have two functions; if both are combined into one, it will be okay, I just split it into two to really learn using functions.
I think that when returning a value, you need to store it somewhere and then use it in the next function, so it should like something like this
def login(br):
br.open("https://example.com/login/index.php")
br.select_form(nr=0)
br.form['username'] = "admin"
br.form['password'] = "mypassword"
br.submit()
if __name__ == "__main__":
br = set_browser()
login(br)
scrape()