Search code examples
databasescalabilityanalyticsdata-mining

Architecture for database analytics


We have an architecture where we provide each customer Business Intelligence-like services for their website (internet merchant). Now, I need to analyze those data internally (for algorithmic improvement, performance tracking, etc...) and those are potentially quite heavy: we have up to millions of rows/customer / day, and I may want to know how many queries we had in the last month, weekly compared, etc... that is the order of billions of entries if not more.

The way it is currently done is quite standard: daily scripts which scan the databases, and generate big CSV files. I don't like this solution for several reasons:

  • as typical with those kinds of scripts, they fall into the write-once and never-touched-again category
  • tracking things in "real-time" is necessary (we have a separate toolset to query the last few hours ATM).
  • This is slow and non-"agile"

Although I have some experience in dealing with huge datasets for scientific usage, I am a complete beginner as far as traditional RDBM goes. It seems that using a column-oriented database for analytics could be a solution (the analytics don't need most of the data we have in the app database), but I would like to know what other options are available for this kind of issue.


Solution

  • You will want to google Star Schema. The basic idea is to model a special data warehouse / OLAP instance of your existing OLTP system in a way that is optimized to provided the type of aggregations you describe. This instance will be comprised of facts and dimensions.

    In the example below, sales 'facts' are modeled to provide analytics based on customer, store, product, time and other 'dimensions'.

    alt text

    You will find Microsoft's Adventure Works sample databases instructive, in that they provide both the OLTP and OLAP schemas along with representative data.