Search code examples
mysqlanalyticsolapstar-schema

star schema for implementing near real time analytics with changing data


I implementing analytics for a medical software. The data to be processed is mainly appointment related. I'm planing to implement star schema for generating reports. I have a few doubts

  1. My data can change like a appointment can be marked as cancelled later, i read that changing data in star schema is not a good idea. If not what is a better approach.
  2. The data to my fact tables will inserted by a background task when the data is added to my main database. Is constant insertion of data to fact table a issue as reposts are viwed in the application almost anytime.
  3. I am planning to implement it in mysql, and if someone can point me to some post releated to performance of mysql with this kind of structure it would be great. Also which is a better engine to implement this schema Innodb or Myisam

Thanks.


Solution

  • I'll try to answer in general terms that are not tied to a specific database technology (I'm a MS SQL Server DWH person).

    To address your specific questions ...

    "1.My data can change like a appointment can be marked as cancelled later, i read that changing data in star schema is not a good idea. If not what is a better approach."

    There are two main table types in DWHes Fact tables and Dimension tables.

    Changing fact or dimensional data in a star schema is perfectly valid. It is not considered good practise to delete dimension records from a DWH.

    You need to make a choice of type 1 (overwite history) or type 2 (keep history) changes to data (Slowly Changing Dimension).

    I'm not sure if you are suggesting deleting fact records here, but a better approach would be to have a flag on each fact record to indicate the status of the appointment (booked/used/cancelled/etc) and if a patient cancels their appointment then change the fact record from status=booked to status=cancelled; not actually deleting the fact record. This way also you can track the number of cancelled appointments.

    To add a complication you could have your fact table keeping history as well, so that you can show an "as at" position, i.e. be able to show the state of the database as at a particular point in time. This increases the size of your database quite a bit, depending on the number of changes that occurr.

    "2.The data to my fact tables will inserted by a background task when the data is added to my main database. Is constant insertion of data to fact table a issue as reposts are viwed in the application almost anytime."

    You need to have a discussion around the frequency of updates / the importance of having up to date data. Generally (as I'm sure you are aare) DWHes are not OLTP systems and so aren't meant to be constantly being updated with new data and able to be reported on the most up to date data. If you want that really you want an OLTP system.

    That said, we have implemented a DWH that did updates every 20 minutes. This had a SQL DWH with an OLAP cube sitting on top. Im not sure if mysql has OLAP technology, but I'd feel sure there is some opensource version available. There are several flavours of OLAP (MOLAP/ROLAP/HOLAP) which give different focus to performance/data currency.

    You would normally want to separate the DWH itself from the reporting DB layer, especially if there are many users.

    "3.I am planning to implement it in mysql, and if someone can point me to some post releated to performance of mysql with this kind of structure it would be great. Also which is a better engine to implement this schema Innodb or Myisam"

    I'll have to pass on this question. I used to know a bit about innoDB adn MyISAM, but its been about 8 years since i played with the technology.

    A very good book on Star Schema DWH design is by Ralph Kimball on DWH Design Book