I'm trying to write a web scraper to automate some of the things I have to do at work. To use the web application on the site, I need to log in with a basic authentication (I know the scheme is basic). In a web browser, I go to the URL, an error message pops up asking for my username and password, which I give, and then I'm allowed to a second login page (just in case that's relevant).
Here's what I do in Python, using urllib2:
theurl = 'http://where-i-work.com/the-backend'
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:42.0) Gecko/20100101 Firefox/42.0'
#nothing seems to work if the server knows I'm using Python
headers = { 'User-Agent' : user_agent,
'Authorization' : 'Basic myauthorizationstring12345'}
#I've seen this sent in my requests, but also double checked from encoding username and pw with base64
data = ''
req = urllib2.Request(theurl, data, headers)
handle = urllib2.urlopen(req)
When I try to make handle
, this yields a 405 error: Method Not Allowed. I've read that 99% of the time, that means I tried to send POST values when I shouldn't have. But I've also seen this information sent in requests when I use Tamper Data. Just to see, I tried sending the info in the headers in data instead (url encoded) and I got a 401 error, like I never sent the login credentials.
Again, just to try, I also tried https instead of http, which yielded a "Certificate Verified Failed" error, which I understand to be a separate issue. Basically, I've been trying what I can think of.
I've also tried using urllib2.HTTPPasswordMgrWithDefaultRealm()
and urllib2.HTTPBasicAuthHandler()
, etc. and I still got the 405 error. Right now I'm going with what I put up there because I want to see everything that is happening while I'm still trying to figure this out.
Do I just not understand how the credentials are being sent normally by a browser? Am I doing something different?
You're providing a non-null value for the data
parameter, so your request is being sent as a POST. Just use keyword arguments and don't specify a value for data
:
req = urllib2.Request(theurl, headers=headers)
An alternative approach would be to use the requests module:
import requests
response = requests.get(
url='http://where-i-work.com/the-backend',
headers={
'User-Agent': 'Mozilla'
},
auth=('username', 'password')
)