Search code examples
pythonweb-crawlerworkerdjango-rq

How to get notified when multiple workers complete a job with an unknown size?


I'm trying to create a web site crawler in Python using django-rq. So far my worker looks like this:

  1. Get the next page from the queue.
  2. Create a page record in the database. Set status=1.
  3. Download the page contents and process. Might take up to a minute or so.
  4. For each link in the page
    1. Check if the link is already registered in the database.
    2. If not, create a new page record. Set status=0 and add the link to the queue.
  5. After the for loop ends, check whether the count of pages with status=0 is 0. If yes, the job is done.

status=1 means the page is processed. status=0 means the page is not processed yet.

Now, this algorithm works just OK with a single worker. However, it does not when there are more workers becuase the end of job routine is sometimes triggered earlier than it should be.

What is the right way to implement this worker?


Solution

  • 1. Tweak your current system

    • If you hold off setting a page's status to 1 until after you have finished processing it then your workers shouldn't declare 'job done' early.
    • Your step 2 is only necessary for the very first page that you start crawling with.

    So your system would be like:

    start job:
    1. Create a page record in the database. Set status=0. Add page to queue.
    
    worker:
    1. Get the next page from the queue.
    2. Download the page contents and process. Might take up to a minute or so.
    3. For each link in the page
        1. Check if the link is already registered in the database.
        2. If not, create a new page record. Set status=0 and add the link to the queue.
    4. After the for loop ends, set status=1 for this page.
    5. Check whether the count of pages with status=0 is 0. If yes, the job is done.
    

    There is the problem that if a subsequent web crawling job is started before the previous one has finished you will only get 'job done' at the end of the last one. You could perhaps add a job-id to your database page record and redefine 'job done' as something like count(status=0 and job-id=x) = 0

    2. Utilise RQ's job class

    From the RQ docs:

    When jobs get enqueued, the queue.enqueue() method returns a Job instance. ... it has a convenience result accessor property, that will return None when the job is not yet finished, or a non-None value when the job has finished (assuming the job has a return value in the first place, of course).

    You could queue two different types of job, one is 'fetch web page' and another for managing the crawl process.

    The management job would initiate and keep track of all the 'fetch web page' jobs. It would know when 'job done' because all of its sub-jobs have completed.

    You wouldn't necessarily need to write anything to a database to manage the crawl process.

    You would need to run 2+ workers so that crawl and fetch can be worked on at the same time, maybe on separate queues.

    def something_web_facing():
        ...
        queue.enqueue(crawl, 'http://url.com/start_point.html')
        ...
    
    def crawl(start_url):
        fetch_jobs = []
        seen_urls = set()
    
        seen_urls.add(start_url)
        fetch_jobs.append( queue.enqueue(fetch, start_url) )
    
        while len(fetch_jobs) > 0:
    
            # loop over a copy of fetch_jobs
            for job in list(fetch_jobs):
    
                # has this job completed yet?
                if job.result:
    
                    # a fetch job returns a list of the next urls to crawl
                    for url in job.result:
    
                        # fetch this url if we haven't seen it before
                        if url not in seen_urls:
                            seen_urls.add(url)
                            fetch_jobs.append( queue.enqueue(fetch, url) )
    
                    fetch_jobs.remove(job)
    
            time.sleep(1)
    
        return "Job done!"
    
    def fetch(url):
        """Get web page from url, return a list of links to follow next"""
    
        html_page = download_web_page(url)
        links_to_follow = find_links_to_follow(html_page)
        return links_to_follow
    

    3. Use someone else's web crawler code

    Scrapy

    You could queue up a job that uses scrapy. Run scrapy from a script