Search code examples
php.netazureamazon-ec2cloud-hosting

Cloud Architecture Stack Opinions - EC2 versus Azure


I have read many blog and articles about the pros and cons of Amazon EC2 versus Microsoft Azure (and Google's App Engine). However, I am trying to decide which would better suite my particular case.

I have a data set - which can be thought of as a standard table of the format:

[id]  [name]  [d0]  [d1]  [d2] .. [d63]
---------------------------------------
0     Name1   0.43 -0.22  0.11   -0.81
1     Name2   0.23  0.65  0.62    0.41
2     Name3  -0.13 -0.23  0.17    0.00
...
N     NameN   0.43 -0.23  0.12    0.01

I ultimately want to do something that (despite my final chosen stack) would equate to an SQL SELECT statement similar to:

SELECT name FROM [table] WHERE (d0*QueryParameter1) + (d1*QueryParameter1) +(d2*QueryParameter2) + ... + (dN*QueryParameterN) < 0.5

where QueryParameter1,2,N are parameters supplied at runtime, and change each time the query is run (so caching is out of the question).

My main concern is with the speed of the query, so I would like advice on which cloud stack option would provide the fastest query result possible.

I can do this a number of ways:

  • (1) Use SQL Azure, just as the query lies above. I have tried this method, and the queries can be quite slow as expected since SQL only gives you a single instance. I can spin up multiple instances of SQL and shard the data, but that gets real expensive real quick.
  • (2) Use Azure Storage Tables. Bloggers claim storage tables are faster in general, but would this still be the case for my query requirements?
  • (3) Use EC2 and spin up several instances with MySQL, possibly incorporating sharding to new instances (cost increases though).
  • (4) Use EC2 with MongoDB, as I've read it is faster than MySQL. Again this is probably dependent on the type of query.
  • (5) Google AppEngine. I'm not really sure how GAE would work with this query structure, but I guess that's why I am looking for opinions.

I'd like to find the best stack combination to optimize my specific need (outlined by the pseudo SQL query above).

Does anyone have any experience in this? Which stack option would result in the fastest query containing many math operators in the WHERE clause?

Cheers, Brett


Solution

  • Your type of query with dynamic coefficients (weights) will require the entire table to be scanned on every query. A SQL database engine is not going to help you here, because there is really nothing that the query optimizer can do.

    In other words, what you need is NOT a SQL database, but really a "NoSQL" database which really optimizes table/row access to the fastest speed possible. So you really shouldn't have to try SQL Azure and MySQL to find out this part of the answer.

    Also, each row in your type of query is completely independent from each other, so it lends itself to simple parallelism. Your choice of platform should be whichever gives you:

    1. Table/row scan at the fastest speed
    2. Ability to highly parallelize your operation

    Each platform you mentioned gives you ability to store huge amounts of blob or table-like data for very fast scan retrieval (e.g. table storage in Azure). Each also gives you the ability to "spin up" multiple instances to process them in parallel. It really depends on which programming environment you're most comfortable in (e.g. Java in Google/Amazon, .NET in Azure). In essence they all do the same thing.

    My personal recommendation is Azure, since you can:

    1. Store massive amounts of data in "table storage", optimized for fast scan retrieval, and partitioned (e.g. over d0 ranges) for optimal parallelism
    2. Dynamically "spin up" as many compute instances as you like to process the data in parallel
    3. Queueing mechanisms to synchronize the results collation

    Azure does what you requires in a very "no-frills" way -- providing just enough infrastructure for you to do your job, and nothing more.