RapGenius posted this article about how they checked all 170k urls that pointed to them by parellizing the scraping task across worker dynos on Heroku using the Ruby library Typhoeus.
I've been working on a project that involves scraping (getting the source) for 1.5 million URLs, and I've been trying to speed it up. Being more comfortable with Python, I've managed to whip up a scraper that parallelizes across my desktop using redis and python multiprocessing. Where I'm confused is how I would modify it to work on Heroku.
Here's how my program is designed right now:
1) An initializer script runs, that stores all the URLs ahead of time in a Redis queue
2) A script, run_workers.py, runs, that starts all the processes like such:
workers = []
q = get_redis_queue(name)
for i in xrange(num_workers):
p = multiprocessing.Process(target=worker.scraper_worker, args=(i, q))
p.start()
workers.append(p)
for w in workers:
w.join()
3) Workers, in worker.py, that do a scraping task like this:
def scraper_worker(worker_id, queue):
#consumes URL from redis queue, visits using python requests
#stores result into MySQL
Can my current program structure be ported directly onto Heroku? What would I put in the Procfile? My first guess would be
scrape: python init_scrape.py
Where init_scrape.py first initializes the queue, then runs the workers. But I have no experience actually distributing a python task on the cloud, and I want to avoid costly mistakes.
Running this locally, I find that storing the results directly into the database (which has 1.5 million rows, for each URL, and an empty space for where the caches will go), each UPDATE query is slow (takes minutes). Is it better to store results in a temporary table, and then merge the two tables afterward?
What technologies am I not using, that I should be? For example, I've seen Celery and Twisted both mentioned as suitable candidates for this kind of thing. I am not familiar with either but I've seen both as suggested alternatives in peripheral googling.
First off, if this "project" is short-lived, or generally won't be run in production, I suggest you don't start looking into "better technologies" until you really see that you need to. If you only ever are going to run this 3 times, it's a waste of time.
To your last question: Twisted is an async framework, much like "node", that will allow a higher concurrency factor on a single machine. Celery is distributed tasks, is very cool, and both are generally worth learning and suit you fine. (I wouldn't bother with Twisted unless the scale was huge). Instead of celery, for your simple case, you might consider "RedisQ", a Python module that does something similar (and has very concise documentation) in Redis.
To your MySQL question: that shouldn't be the case. A 1.5M rows table is not small, but inserts and updates should definitely not take minutes. Begin investigation by turning off any keys, indexes and primary keys you have.
To your Heroku architecture question: you would have 2 types of processes: a "web" process (which is your init_scrape.py
), of which you will have 1 (heroku ps:scale web=1
), and a "worker" process (of which you can have as many as you'd like, and is that increases your scale).
Your procfile will look something like:
web: python init_scrape.py
worker: python worker.py
Note that if you want to communicate with your init_scrape.py process, you must call it "web" in the Procfile. Note also that in that case you must bind a TCP listener (basically: spin up a simple http server) to the port os.environ['PORT']
. Only "web" processes get routed HTTP requests from "outside" of Heroku.
Also, note that all your processes should never really "exit" (Or Heroku will simple re-spin them). When they have nothing to do, they should simple wait/poll for tasks. You can then increase or decrease the number of workers by using heroku ps:scale
.
The main issue here, with regards to what you write, is that your master will not spin up workers. The worker processes will be in entirely different dynos. The worker will simply initialize the redis queue (as you menion), and maybe spin up a simple http server to communicate with, and then sit idly by.
The workers will need to be passed the redis queue name, and each worker will be in a dyno of its own.
Good luck!