I am working currently working on a spider; but I need to be able to call the Spider() function more than once to follow links, here is my code:
import httplib, sys, re
def spider(target, link):
try:
conn = httplib.HTTPConnection(target)
conn.request("GET", "/")
r2 = conn.getresponse()
data = r2.read().split('\n')
for x in data[:]:
if link in x:
a=''.join(re.findall("href=([^ >]+)",x))
a=a.translate(None, '''"'"''')
if a:
return a
except:
exit(0)
print spider("www.yahoo.com", "http://www.yahoo.com")
but I only get 1 link from the output, how can I make this all the links?
also how can I get the subsite from the links so the spider can follow them?
This is probably closer to what you're looking for
import httplib, sys, re
def spider(link, depth=0):
if(depth > 2): return []
try:
conn = httplib.HTTPConnection(link)
conn.request("GET", "/")
r2 = conn.getresponse()
data = r2.read().split('\n')
links = []
for x in data[:]:
if link in x:
a=''.join(re.findall("href=([^ >]+)",x))
a=a.translate(None, '"' + "'")
if a:
links.append(a)
# Recurse for each link
for link in links:
links += spider(link, (depth + 1))
return links
except:
exit(1)
print spider("http://www.yahoo.com")
It's untested, but the basics are there. Scrape all the links, then recursively crawl them. The function returns a list of links on the page on each call. And when a page is recursively crawled, those links that are returned by the recursive call are added to this list. The code also has a max recursion depth so you don't go forever.
It's missing some obvious oversights, like cycle detection.
A few sidenotes, there are better ways to do some of this stuff.
For example, urllib2 can fetch webpages for you a lot easier than using httplib.
And BeautifulSoup extracts links from web pages better than your regex + translate kluge.