Search code examples
pythonurlcookiesurllib2backslash

Python error when adding variable with backslash character to URL string


I am trying to scrape some data from a website that assigns a session cookie and generates HTML that contains a crumb code that I need to append to a URL to get to the data. I run into problems (HTTP 401 Unauthorized) when the crumb variable contains a backslash... Since crumb is a variable, I could not figure out how to add r' to the beginning. I have tried adding .encode('string-escape') and .replace('\\','\\\\') to the crumb variable, but I cannot get it to work.

My code, in python 2.7, looks something like this:

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))    
opener.open('http://www.sample.com')

#Some code here that looks for crumb code in HTML

crumb = 'abc\xyz'

#This line fails when crumb contains a backslash
opener.open('http://www.sample.com/data=' + crumb)

cj.clear()

Does anyone know how I can avoid the 401 error when trying to open a URL string that contains a backslash?

Also, is it necessary to clear the session cookies each time if I'm looping through multiple crumbs?

Update: It turns out that the backslashes are being brought in from the \u002F in the HTML. I believe it'll work if I convert these to a forward slash before adding the string to the URL. How can I convert the \u002F in a string to a /?


Solution

  • Since you cannot use crumb = r'abc\xyz'. I believe that str.encode('string-escape') function might help. Try:

    crumb = 'abc\xyz'
    crumb.encode('string-escape')