Sort of an open-ended question here. I needed to go across a Job site and search for a Job Description tag and a Skill requirement (I'm done with this). I basically wanted to know, how do I crawl across the site? As in, go from test.com to test.com/a and so on....?? Basically, crawl the page.
This is my code to search within the page. I need to find all the possible such pages in the site and get the link. THIS IS NOT HOMEWORK. I'm just doing this on the side...
import urllib2
import re
html_content = urllib2.urlopen('http://www.ziprecruiter.com/job/Systems- Engineer/b5452eab/?source=customer-cpc-indeed').read()
matchDescription = re.findall('Bachelor', html_content);
matchSkill = re.findall('VMware', html_content);
print matchDescription
print matchSkill
if ( len(matchDescription) and len(matchSkill) )== 0:
print 'I did not find anything'
else:
print 'My string is in the html'
Consider using Scrapy
or some other existing scraping framework. Otherwise, you need to find the necessary links manually using lxml
or some other HTML parser and crawl them using some manual mechanism based on urllib
or something like that and some data structures to store input and output data.