I'm about to undertake a large project, where I'll need scheduled tasks (cron jobs) to run a script that will loop through my entire database of entities and make calls to multiple API's such as Facebook, Twitter & Foursquare every 10 minutes. I need this application to be scalable.
I can already foresee a few potential pitfalls...
My question is how to structure my application...?
Thanks!
I'm about to undertake a large project, where I'll need scheduled tasks (cron jobs) to run a script that will loop through my entire database of entities and make calls to multiple API's such as Facebook, Twitter & Foursquare every 10 minutes. I need this application to be scalable.
Your best option is to design the application to make use of a distributed database, and deploy it on multiple servers.
You can design it to work in two "ranks" of servers, not unlike the map-reduce approach: lightweight servers that only perform queries and "pre-digest" some data ("map"), and servers that aggregate the data ("reduce").
Once you do that, you can establish a performance baseline and calculate that, say, if you can generate 2000 queries per minute and you can handle as many responses, then you need a new server every 20,000 users. In that "generate 2000 queries per minute" you need to factor in:
An advantage of this architecture is that you can start small - a testbed can be built with a single machine running both Connector, Mapper, Reducer, Command and Control and Persistence. When you grow, you just outsource different services to different servers.
On several distributed computing platforms, this also allows you to run queries faster by judiciously allocating Mappers geographically or connectivity-wise, and reduce the traffic costs between your various platforms by playing with, e.g. Amazon "zones" (Amazon has also a message service that you might find valuable for communicating between the tasks)
One note: I'm not sure that PHP is the right tool for this whole thing. I'd rather think Python.
At the 20,000 users-per-instance traffic level, though, I think that you'd better take this up with the guys at Facebook, Foursquare etc. . At a minimum you might glean some strategies such as running the connector scripts as independent tasks, each connector sorting its queue based on that service's user IDs, to leverage what little data locality there might be, and taking advantage of pipelining to squeeze more bandwidth with less server load. At the most, they might point you to bulk APIs or different protocols, or buy you for one trillion bucks :-)