Search code examples
pythonweb-crawlerhttplib

Filtering a variable so it only contains a specified string python


I am trying to make link crawler in python; I know about harvestman but that's not what I am looking for. Here is what I have so far:

import httplib, sys

target=sys.argv[1]
subsite=sys.argv[2]
link = "http://"+target+subsite

def spider():
    while 1:
        conn = httplib.HTTPConnection(target)
        conn.request("GET", subsite)
        r2 = conn.getresponse()
        data = r2.read().split('\n')
        for x in data[:]:
            if link in x:
                print x
spider()

But I cant seem to find a way to filter x, so I can retrieve the links.


Solution

  • I think would work

    import re
    re.findall("href=([^ >]+)",x)