We use Postgres for analytics (star schema). Every few seconds we get reports on ~500 metrics types. The simplest schema would be:
timestamp metric_type value
78930890 FOO 80.9
78930890 ZOO 20
Our DBA has came up with a suggestion to flatten all reports of the same 5 seconds to:
timestamp metric1 metric2 ... metric500
78930890 90.9 20 ...
Some developers push back on this saying this adds a huge complexity on development (batching data so it is written in one shot) and to maintainability (just looking at the table or adding fields is more complex).
Is the DBA model the standard practice in such systems or only a last resort once the original model is clearly not scalable enough?
EDIT: the end goal is to draw a line chart for the users. So queries will mostly be selecting a few metrics, folding them by hours, and selecting min/max/avg per hour (or any otehr time period).
EDIT: the DBA arguments are:
This is relevant from day 1 (see below) but even if was not this is something the system eventually will need to do and migrating from another schema will be a pain
Reducing the number of rows x500 times will allow more efficient indexes and memory (the table will contain hundreds of millions of rows before this optimization)
When selecting multiple metrics the suggested schema will allow one pass over the data instead of separate query for each metric (or some complex combinations of OR and GroupBY)
EDIT: 500 metrics is an "upper bound" but in practice most of the time only ~40 metrics are reported per 5 seconds (not the same 40 though)
The DBA's suggestion isn't totally unreasonable if the metrics are fairly fixed, and make sense to group together. A couple of problems you'll likely face, though:
Instead, you might want to consider using an HSTORE column:
CREATE TABLE metrics (
timestamp INTEGER,
values HSTORE
)
This will give you some flexibility in storing attributes, and allows for indices. For example, to index just one of the metrics:
CREATE INDEX metrics_metric3 ON metrics ((values->'metric3'))
One drawback of this is that values can only be text strings… so if you need to do numeric comparisons, a JSON column might also be worth considering:
CREATE TABLE metrics (
timestamp INTEGER,
values JSON
)
CREATE INDEX metrics_metric3 ON metrics ((values->'metric3'))
The drawback here is that you'll need to use Postgres 9.3, which is still reasonably new.