I have read many blog and articles about the pros and cons of Amazon EC2 versus Microsoft Azure (and Google's App Engine). However, I am trying to decide which would better suite my particular case.
I have a data set - which can be thought of as a standard table of the format:
[id] [name] [d0] [d1] [d2] .. [d63]
---------------------------------------
0 Name1 0.43 -0.22 0.11 -0.81
1 Name2 0.23 0.65 0.62 0.41
2 Name3 -0.13 -0.23 0.17 0.00
...
N NameN 0.43 -0.23 0.12 0.01
I ultimately want to do something that (despite my final chosen stack) would equate to an SQL SELECT
statement similar to:
SELECT name FROM [table] WHERE (d0*QueryParameter1) + (d1*QueryParameter1) +(d2*QueryParameter2) + ... + (dN*QueryParameterN) < 0.5
where QueryParameter1,2,N
are parameters supplied at runtime, and change each time the query is run (so caching is out of the question).
My main concern is with the speed of the query, so I would like advice on which cloud stack option would provide the fastest query result possible.
I can do this a number of ways:
MySQL
. Again this is probably dependent on the type of query.I'd like to find the best stack combination to optimize my specific need (outlined by the pseudo SQL
query above).
Does anyone have any experience in this? Which stack option would result in the fastest query containing many math operators in the WHERE
clause?
Cheers, Brett
Your type of query with dynamic coefficients (weights) will require the entire table to be scanned on every query. A SQL database engine is not going to help you here, because there is really nothing that the query optimizer can do.
In other words, what you need is NOT a SQL database, but really a "NoSQL" database which really optimizes table/row access to the fastest speed possible. So you really shouldn't have to try SQL Azure and MySQL to find out this part of the answer.
Also, each row in your type of query is completely independent from each other, so it lends itself to simple parallelism. Your choice of platform should be whichever gives you:
Each platform you mentioned gives you ability to store huge amounts of blob or table-like data for very fast scan retrieval (e.g. table storage in Azure). Each also gives you the ability to "spin up" multiple instances to process them in parallel. It really depends on which programming environment you're most comfortable in (e.g. Java in Google/Amazon, .NET in Azure). In essence they all do the same thing.
My personal recommendation is Azure, since you can:
Azure does what you requires in a very "no-frills" way -- providing just enough infrastructure for you to do your job, and nothing more.