I am trying to make link crawler in python; I know about harvestman but that's not what I am looking for. Here is what I have so far:
import httplib, sys
target=sys.argv[1]
subsite=sys.argv[2]
link = "http://"+target+subsite
def spider():
while 1:
conn = httplib.HTTPConnection(target)
conn.request("GET", subsite)
r2 = conn.getresponse()
data = r2.read().split('\n')
for x in data[:]:
if link in x:
print x
spider()
But I cant seem to find a way to filter x, so I can retrieve the links.
I think would work
import re
re.findall("href=([^ >]+)",x)