Search code examples
pythonweb-crawlerhttplib

How to return multiple values in python


I am working currently working on a spider; but I need to be able to call the Spider() function more than once to follow links, here is my code:

import httplib, sys, re

def spider(target, link):
        try:
        conn = httplib.HTTPConnection(target)
        conn.request("GET", "/")
        r2 = conn.getresponse()
        data = r2.read().split('\n')
        for x in data[:]:
            if link in x:
                a=''.join(re.findall("href=([^ >]+)",x))
                a=a.translate(None, '''"'"''')
                if a:
                    return a
    except:
        exit(0)

print spider("www.yahoo.com", "http://www.yahoo.com")

but I only get 1 link from the output, how can I make this all the links?

also how can I get the subsite from the links so the spider can follow them?


Solution

  • This is probably closer to what you're looking for

    import httplib, sys, re
    
    def spider(link, depth=0):
        if(depth > 2): return []
    
        try:
            conn = httplib.HTTPConnection(link)
            conn.request("GET", "/")
            r2 = conn.getresponse()
            data = r2.read().split('\n')
            links = []
            for x in data[:]:
                if link in x:
                    a=''.join(re.findall("href=([^ >]+)",x))
                    a=a.translate(None, '"' + "'")
                    if a:
                        links.append(a)
    
            # Recurse for each link
            for link in links:
                links += spider(link, (depth + 1))
    
            return links
    
        except:
            exit(1)
    
    print spider("http://www.yahoo.com")
    

    It's untested, but the basics are there. Scrape all the links, then recursively crawl them. The function returns a list of links on the page on each call. And when a page is recursively crawled, those links that are returned by the recursive call are added to this list. The code also has a max recursion depth so you don't go forever.

    It's missing some obvious oversights, like cycle detection.

    A few sidenotes, there are better ways to do some of this stuff.

    For example, urllib2 can fetch webpages for you a lot easier than using httplib.

    And BeautifulSoup extracts links from web pages better than your regex + translate kluge.