I've written a simple script in Python.
It parses the hyperlinks from a webpage, and afterwards these links are retrieved to parse some information.
I have similar scripts running and re-using the writefunction without any problems, for some reason it fails, and I can't figure it out why.
General Curl init:
storage = StringIO.StringIO()
c = pycurl.Curl()
c.setopt(pycurl.USERAGENT, USER_AGENT)
c.setopt(pycurl.COOKIEFILE, "")
c.setopt(pycurl.POST, 0)
c.setopt(pycurl.FOLLOWLOCATION, 1)
#Similar scripts are working this way, why this script not?
c.setopt(c.WRITEFUNCTION, storage.write)
First call to retreive links:
URL = "http://whatever"
REFERER = URL
c.setopt(pycurl.URL, URL)
c.setopt(pycurl.REFERER, REFERER)
c.perform()
#Write page to file
content = storage.getvalue()
f = open("updates.html", "w")
f.writelines(content)
f.close()
... Here the magic happens and links are extracted ...
Now looping these links:
for i, member in enumerate(urls):
URL = urls[i]
print "url:", URL
c.setopt(pycurl.URL, URL)
c.perform()
#Write page to file
#Still the data from previous!
content = storage.getvalue()
f = open("update.html", "w")
f.writelines(content)
f.close()
#print content
... Gather some information ...
... Close objects etc ...
If you want to download urls to different files in sequence (no concurrent connections):
for i, url in enumerate(urls):
c.setopt(pycurl.URL, url)
with open("output%d.html" % i, "w") as f:
c.setopt(c.WRITEDATA, f) # c.setopt(c.WRITEFUNCTION, f.write) also works
c.perform()
Note:
storage.getvalue()
returns everything that was written to storage
from the moment it is created. In your case you should find the output from multiple urls in itopen(filename, "w")
overwrites the file (previous content is gone) i.e., update.html
contains whatever is in content
on the last iteration of the loop