EDIT:(SOLVED) When I am reading the values in from my file a newline char is getting added onto the end.(\n) this is splitting my request string at that point. I think it's to do with how I saved the values to the file in the first place. Many thanks.
I have I have the following code:
results = 'http://www.myurl.com/'+str(mystring)
print str(results)
request = urllib2.Request(results)
request.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)')
opener = urllib2.build_opener()
text = opener.open(request).read()
Which is in a loop. after the loop has run a few times str(mystring) changes to give a different set of results. I can loop the script as many times as I like keeping the value of str(mystring) constant but every time I change the value of str(mystring) I get an error saying no host given when the code tries to build the opener.
opener = urllib2.build_opener()
Can anyone help please?
TIA,
Paul.
EDIT:
More code here.....
import sys
import string
import httplib
import urllib2
import re
import random
import time
def StripTags(text):
finished = 0
while not finished:
finished = 1
start = text.find("<")
if start >= 0:
stop = text[start:].find(">")
if stop >= 0:
text = text[:start] + text[start+stop+1:]
finished = 0
return text
mystring="test"
d={}
with open("myfile","r") as f:
while True:
page_counter=0
print str(mystring)
try:
while page_counter <20:
results = 'http://www.myurl.com/'+str(mystring)
print str(results)
request = urllib2.Request(results)
request.add_header('User-Agent','Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)')
opener = urllib2.build_opener()
text = opener.open(request).read()
finds = (re.findall('([\w\.\-]+'+mystring+')',StripTags(text)))
for find in finds:
d[find]=1
uniq_emails=d.keys()
page_counter = page_counter +1
print "found this " +str(finds)"
random.seed()
n = random.random()
i = n * 5
print "Pausing script for " + str(i) + " Seconds" + ""
time.sleep(i)
mystring=next(f)
except IOError:
print "No result found!"+""
In the while loop, you're setting results to something which is not a url:
results = 'myurl+str(mystring)'
It should probably be results = myurl+str(mystring)
By the way, it appears there's no need for all the casting to string (str()
) you do:
(expanded on request)
print str(foo)
: in such a case, str() is never necessary. Python will always print foo's
string representationresults = 'http://www.myurl.com/'+str(mystring)
. This is also unnecessary; mystring
is already a string, so 'http://www.myurl.com/' + mystring
would suffice.print "Pausing script for " + str(i) + " Seconds"
. Here you would get an error without str()
since you can't do string + int. However, print "foo", 1, "bar"
does work. As do print "foo %i bar" % 1
and print "foo {0} bar".format(1)
(see here)