Can I use cron jobs for my application (needs to be extremely scalable)?

I'm about to undertake a large project, where I'll need scheduled tasks (cron jobs) to run a script that will loop through my entire database of entities and make calls to multiple API's such as Facebook, Twitter & Foursquare every 10 minutes. I need this application to be scalable.

I can already foresee a few potential pitfalls...

Fetching data from API's is slow..
With thousands of records (constantly increasing) in my database, it's going to take too much time to process every record within 10 minutes.
Some shared servers only stop scripts running after 30 seconds.
Server issues due to constant intensive scripts running.

My question is how to structure my application...?

Could I create multiple cron jobs to handle small segments of my database (this will have to be automated)?
This will require potentially thousands of cron jobs.. Is that sustainable?
How to bypass the 30 sec issue with some servers?
Is there a better way to go about this?

Thanks!

Solution

I'm about to undertake a large project, where I'll need scheduled tasks (cron jobs) to run a script that will loop through my entire database of entities and make calls to multiple API's such as Facebook, Twitter & Foursquare every 10 minutes. I need this application to be scalable.

Your best option is to design the application to make use of a distributed database, and deploy it on multiple servers.

You can design it to work in two "ranks" of servers, not unlike the map-reduce approach: lightweight servers that only perform queries and "pre-digest" some data ("map"), and servers that aggregate the data ("reduce").

Once you do that, you can establish a performance baseline and calculate that, say, if you can generate 2000 queries per minute and you can handle as many responses, then you need a new server every 20,000 users. In that "generate 2000 queries per minute" you need to factor in:

data retrieval from the database
traffic bandwidth from and to the control servers
traffic bandwidth to Facebook, Foursquare, Twitter etc.
necessity to log locally (and maybe distill and upload log digests to Command and Control)

An advantage of this architecture is that you can start small - a testbed can be built with a single machine running both Connector, Mapper, Reducer, Command and Control and Persistence. When you grow, you just outsource different services to different servers.

On several distributed computing platforms, this also allows you to run queries faster by judiciously allocating Mappers geographically or connectivity-wise, and reduce the traffic costs between your various platforms by playing with, e.g. Amazon "zones" (Amazon has also a message service that you might find valuable for communicating between the tasks)

One note: I'm not sure that PHP is the right tool for this whole thing. I'd rather think Python.

At the 20,000 users-per-instance traffic level, though, I think that you'd better take this up with the guys at Facebook, Foursquare etc. . At a minimum you might glean some strategies such as running the connector scripts as independent tasks, each connector sorting its queue based on that service's user IDs, to leverage what little data locality there might be, and taking advantage of pipelining to squeeze more bandwidth with less server load. At the most, they might point you to bulk APIs or different protocols, or buy you for one trillion bucks :-)