Search code examples
databaseoptimizationdatabase-designtime-seriesdata-warehouse

time series data in a relational database?


I have time series data in a relational database (postgres). Data import to the database every 5 minutes, but imput get overwritten during the day, meaning at the end of the day there is only 1 record for that day for specific id (id and date-> composite PKs).

current process is like this ->Data comes in and is evaluated the same way 1:1. (data comes in every table as they are in source, there is many redundancy.

3 problems:

  • currently performance of getting data out of database(reading) is fast (good performance)

frontend get query from this database and show data. result of the query is very fast. if I do normalization then getting the query become slower, but writing and updating become easier. how can I optimize this database?

  • missing data (ignore this problem )

if we are able to store more records daily (history of one ID in different points of time everyday) then we can show comparison of two points in time in a day. does database support huge amoount of data every day?

  • DWH

source is just one, all data come from one source. can we have DWH for it or since source is only one, there is no need for it?

Edit:

How can I optimise this database?

currently there is only one Schema in a database. Data comes in and is evaluated the same way 1:1. writng is hard since we have redundany.

my solution:

I want to create 3 schemas for this database.

1 schema, for inserting data into tables, tables structure is base on data source. ( I assume data remains here temporary, and will be transfer in second schema)

2 schema, incoming data stored, and data is structured in 3NF.

3 Schema, denormlising data again because we need to get fast query (fast reading is required).


Solution

  • Your three schema model is exactly how this has been done for many years.

    Schema 1:

    Names: Staging/Landing/Ingestion

    Schema matches the source system but it is cleared and reloaded for every load batch. Typically has a "looser" schema definition to allow for import and capture of bad data

    Schema 2:

    Names: Replica/ODS/Persisted data store

    Schema 2 is never cleared, it's permanent. Following a data load, this layer should look exactly like your source systems. Data in schema 1 is "merged" into schema 2 each time. For example on a daily load cycle, Schema 1 just contains that days data but schema 2 contains the entire history of data loaded. Reference data is merged on a known primary key. Transactional data might be merged on a key or it might be merged on a "windowing" basis - i.e. delete the last days data from schema 2 and load schema 1 in

    Some people like to have a "point in time view" where they can recreate what the source system looks like a historical point in time. I've never seen anyone use that though.

    Schema 3:

    Names: Business Layer/Star Schema/Reporting Layer/Datamart/Sematic Layer

    Layer 2, which is usually a replica of an OLTP data model (OLTP is optimised for entering data). This is transformed into a data model that is optimised for reporting.

    The tried and tested data model here is a star schema. It's been around for decades. If you research any reporting tool (i.e. Power BI), thay all say that the preferred data model to report from is a star schema. Yes a star schema is denormalised and has other benefits beyonf perforamnce, for example it is more easily understood by a business user, supports slowly changing dimensions etc.

    All these concepts are explained further online but of you have any specific questions happy to expand further