I am trying to scrape some data from a website that assigns a session cookie and generates HTML that contains a crumb code that I need to append to a URL to get to the data. I run into problems (HTTP 401 Unauthorized) when the crumb variable contains a backslash... Since crumb is a variable, I could not figure out how to add r' to the beginning. I have tried adding .encode('string-escape') and .replace('\\','\\\\') to the crumb variable, but I cannot get it to work.
My code, in python 2.7, looks something like this:
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open('http://www.sample.com')
#Some code here that looks for crumb code in HTML
crumb = 'abc\xyz'
#This line fails when crumb contains a backslash
opener.open('http://www.sample.com/data=' + crumb)
cj.clear()
Does anyone know how I can avoid the 401 error when trying to open a URL string that contains a backslash?
Also, is it necessary to clear the session cookies each time if I'm looping through multiple crumbs?
Update: It turns out that the backslashes are being brought in from the \u002F in the HTML. I believe it'll work if I convert these to a forward slash before adding the string to the URL. How can I convert the \u002F in a string to a /?
Since you cannot use crumb = r'abc\xyz'
. I believe that str.encode('string-escape')
function might help. Try:
crumb = 'abc\xyz'
crumb.encode('string-escape')