Search code examples
mysqldata-warehousebusiness-intelligencequery-performance

Optimal way of storing performance data for statistics (graphs)


Currently I'm working on a dashboard in PHP/MySQL which contains several statistics/facts such as: amount of items sold, revenue, gender (male/female) ratio of users etc. (all filterable on last week/month/year). The amount of data is (currently) not that much: 20.000 user rows, 1.000 items, 500 items sold per day but is expected to grow in the future, perhaps even exponentially.

Now, there is a wish to have several graphs displaying the performance to see whether strategy changes have impacts on the amount of users, revenue, gender ratio etc. For this, it is necessary to have numbers per day. Currently, the dashboard can only display "NOW() - 1 week/1 month/1 year" but for showing a graph outlining the growth, these numbers should be saved on a daily basis.

My question is: what are the options in this case? A cronjob could be set in place to save these numbers and write them to a separate 'performance' or 'history' table that saves the visitors, sales, gender ratio etc. in rows linked to the date of that day. This is good for performance, but certain data gets lost. Another option is to compute these numbers with complex queries (group by day) etc, but that seems to intensive since the queries are performed on the production database. Especially since the database structure is a little complex. Thinking of avoiding doing this on the production database, is setting up a data-warehouse with ETL-processes a better option to avoid overloading the production database? In that case the data would not be displayed live.

I honestly have no idea what is the best option in this case. I'm very curious about the answers! Many thanks.


Solution

  • Running query on a production database (especially one which is growing in volume and complexity) become a losing proposition very quickly. There are a lot of possible alternative, basically the entire field of Business Intelligence is grown as as solution of this problem.

    For a small system where you just want to avoid to query the production database probably the development of a full blown Data Warehouse is overkill. It is impossible to give a reasonable answer without knowing more, but I would go for one of the following (in growing order of complexity/degree of result):

    1. Instead of directly show the result of the query, save it in a table and query the table
    2. Clone your production database then query the clone
    3. Extract relevant data from production database in a structure which save relevant data and preserve history (google Data Vault)
    4. Direct over the production DB, or over solution 2 or 3 build a dimensional model (google Kimball Dimensional Model). Pay attention that to do a good job you have to consider what kind of queries you want to do. You could end up with different designs for different requirement.

    It is also relevant which technology are you using and what are the options available on your available architecture. Depending on what you have on hand, you could have some solution, even complex ones, very much simplified. Do some research.