I'm trying to create a web site crawler in Python using django-rq. So far my worker looks like this:
status=1
.status=0
and add the link to the queue.status=1
means the page is processed. status=0
means the page is not processed yet.
Now, this algorithm works just OK with a single worker. However, it does not when there are more workers becuase the end of job routine is sometimes triggered earlier than it should be.
What is the right way to implement this worker?
So your system would be like:
start job:
1. Create a page record in the database. Set status=0. Add page to queue.
worker:
1. Get the next page from the queue.
2. Download the page contents and process. Might take up to a minute or so.
3. For each link in the page
1. Check if the link is already registered in the database.
2. If not, create a new page record. Set status=0 and add the link to the queue.
4. After the for loop ends, set status=1 for this page.
5. Check whether the count of pages with status=0 is 0. If yes, the job is done.
There is the problem that if a subsequent web crawling job is started before the previous one has finished you will only get 'job done' at the end of the last one.
You could perhaps add a job-id to your database page record and redefine 'job done' as something like count(status=0 and job-id=x) = 0
From the RQ docs:
When jobs get enqueued, the queue.enqueue() method returns a Job instance. ... it has a convenience result accessor property, that will return None when the job is not yet finished, or a non-None value when the job has finished (assuming the job has a return value in the first place, of course).
You could queue two different types of job, one is 'fetch web page' and another for managing the crawl process.
The management job would initiate and keep track of all the 'fetch web page' jobs. It would know when 'job done' because all of its sub-jobs have completed.
You wouldn't necessarily need to write anything to a database to manage the crawl process.
You would need to run 2+ workers so that crawl
and fetch
can be worked on at the same time, maybe on separate queues.
def something_web_facing():
...
queue.enqueue(crawl, 'http://url.com/start_point.html')
...
def crawl(start_url):
fetch_jobs = []
seen_urls = set()
seen_urls.add(start_url)
fetch_jobs.append( queue.enqueue(fetch, start_url) )
while len(fetch_jobs) > 0:
# loop over a copy of fetch_jobs
for job in list(fetch_jobs):
# has this job completed yet?
if job.result:
# a fetch job returns a list of the next urls to crawl
for url in job.result:
# fetch this url if we haven't seen it before
if url not in seen_urls:
seen_urls.add(url)
fetch_jobs.append( queue.enqueue(fetch, url) )
fetch_jobs.remove(job)
time.sleep(1)
return "Job done!"
def fetch(url):
"""Get web page from url, return a list of links to follow next"""
html_page = download_web_page(url)
links_to_follow = find_links_to_follow(html_page)
return links_to_follow
You could queue up a job that uses scrapy. Run scrapy from a script