To practise some more bits of python I've been having a go at the challenges on pythonchallenge.com
In brief, this challenge as a first step requires one to load an html page from a url with a number at the end. The page contains a single line of text which has in it a number. That number is used to replace the existing one in the url, and so take you to the next page in the sequence. Apparently this continues for some time... (there is more to this challenge, but getting that part working is the first step).
My code for doing so is below (limited to running through what should be the first four pages in the sequence, for the time being). For some reason it works the first time - it gets the second page in the sequence, reads the number, goes to the third, and reads the number there. But then it gets stuck on the third. I don't understand why, though think it might be something to do with my attempt to turn the number into a string before putting it on the end of the URL. To answer the obvious question, yes I know that pythonchallenge is working OK - you can do the url-numbers thing manually for as long as you have the patience, to confirm, if you like :p
import httplib2
import re
counter = 0
new = '12345' #the number for the initial page in the sequence, as a string
while True:
counter = counter + 1
if counter == 5:
break
original = 'http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing='
nextpage = original+new #each page in the sequence is visited by adding
#the number after 'nothing='
print(nextpage)
h = httplib2.Http('.cache')
response, content = h.request(nextpage, "GET") #get the content of the page,
#which includes the number for the
#*next* page in the sequence
p = re.compile(r'\d{4,5}$') #regex to find a 4 to 5 digit number at the end of
#the content
new = str((p.findall(content))) #make the regex result a string - is this
#where the problem lies?
print('cached?', response.fromcache) #I was worried my requests were somehow
#being cached not actually sent afresh to
#pythonchallenge. But it seems they aren't.
print(content)
print(new)
And the output of the above is as follows, below. It seems to work fine for the first run through (adding the 92512 to the url and successfully getting the next page and finding the next value) but after that it just gets stuck, and doesn't seem to load the following page in the sequence. Testing by changing the url manually in a browser confirms that the number is correct and pythonchallenge is working OK.
It looks to me like something is going wrong turning my regex search into a string to add onto the end of the URL - but why it should work the first time and not the second I don't know. I was also concerned maybe my requests were only getting as far as a cache (I'm new to httplib2 and not confident about how it does caching) but they seem not to be. I also added a no-cache argument to the request just to be sure (not shown in this code) but it didn't help.
http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=12345
('cached?', False)
and the next nothing is 92512
['92512']
http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=['92512']
('cached?', False)
and the next nothing is 72758
['72758']
http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=['72758']
('cached?', False)
and the next nothing is 72758
['72758']
http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=['72758']
('cached?', False)
and the next nothing is 72758
['72758']
I would be grateful to anyone who can point out where I am going wrong, as well as for any relevant tips
Thanks in advance...
http://www.pythonchallenge.com/pc/def/linkedlist.php?nothing=['72758']
^^ ^^
The problem is here I think. findall()
return a list:
re.findall(pattern, string[, flags])
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
-- Python doc