Search code examples
pythonsqllarge-data

Python Strategy for Large Scale Analysis (on-the-fly or deferred)


To analyze a large number of websites or financial data and pull out parametric data, what are the optimal strategies?

I'm classifying the following strategies as either "on-the-fly" or "deferred". Which is best?

  1. On-the-fly: Process data on-the-fly and store parametric data into a database
  2. Deferred: Store all the source data as ASCII into a file system and post process later, or with a processing-data-daemon
  3. Deferred: Store all pages as a BLOB in a database to post-process later, or with a processing-data-daemon

Number 1 is simplest, especially if you only have a single server. Can #2 or #3 be more efficient with a single server, or do you only see the power with multiple servers?

Are there any python projects that are already geared toward this kind of analysis?

Edit: by best, I mean fastest execution to prevent user from waiting with ease of programming as secondary


Solution

  • I'd use celery either on a single or on multiple machines, with the "on-the-fly" strategy. You can have an aggregation Task, that fetches data, and a process Task that analyzes them and stores them in a db. This is a highly scalable approach, and you can tune it according to your computing power.

    The "on-the-fly" strategy is more efficient in a sense that you process your data in a single pass. The other two involve an extra step, re-retrieve the data from where you saved them and process them after that.

    Of course, everything depends on the nature of your data and the way you process them. If the process phase is slower than the aggregation, the "on-the-fly" strategy will hang and wait until completion of the processing. But again, you can configure celery to be asynchronous, and continue to aggregate while there are data yet unprocessed.