Search code examples
sql-serverdatabase-designiotdatabase-performance

Database design for IoT application


Our application shows near-real-time IoT data (up to 5 minute intervals) for our customers' remote equipment.

The original pilot project stores every device reading for all time, in a simple "Measurements" table on a SQL Server 2008 database. The table looks something like this:

Measurements: (DeviceId, Property, Value, DateTime).

Within a year or two, there will be maybe 100,000 records in the table per device, with the queries typically falling into two categories:

  • "Device latest value" (95% of queries): looking at the latest value only
  • "Device daily snapshot" (5% of queries): looking at a single representative value for each day

We are now expanding to 5000 devices. The Measurements table is small now, but will quickly get to half a billion records or so, for just those 5000 devices.

The application is very read-intensive, with frequently-run queries looking at the "Device latest values" in particular.

[EDIT #1: To make it less opinion-based]

What database design techniques can we use to optimise for fast reads of the "latest" IoT values, given a big table with years worth of "historic" IoT values?

One suggestion from our team was to store MeasurementLatest and MeasurementHistory as two separate tables.

[EDIT #2: In response to feedback]

In our test database, seeded with 50 million records, and with the following index applied:

CREATE NONCLUSTERED INDEX [IX_Measurement_DeviceId_DateTime] ON Measurement (DeviceId ASC, DateTime DESC)

a typical "get device latest values" query (e.g. below) still takes more than 4,000 ms to execute, which is way too slow for our needs:

SELECT DeviceId, Property, Value, DateTime
FROM Measurements m
WHERE m.DateTime = (
  SELECT MAX(DateTime) 
  FROM Measurements m2
  WHERE m2.DeviceId = m.DeviceId)

Solution

  • This is a very broad question - and as such, it's unlikely you'll get a definitive answer.

    However, I have been in a similar situation, and I'll run through my thinking and eventual approach. In summary though - I did option B but in a way to mirror option A: I used a filtered index to 'mimic' the separate smaller table.

    My original thinking was to have two tables - one with the 'latest data only' for most reporting, then a table with all historical values. An alternate was to have two tables - one with all records, and one with just the latest.

    When inserting a new row, it would typically need to therefore update at least two rows, if not more (depending on how it's stored).

    Instead, I went for a slightly different route

    • Put all the data into one table
    • On that one table, add a new column 'Latest_Flag' (bit, NOT NULL, DEFAULT 1). If it's 1 then it's the latest value; otherwise it's historical
    • Have a filtered index on the table that has all columns (with appropriate column order) and filter of Latest_Flag = 1
      • This filtered index is similar to a second copy of the table with just the latest rows only
    • The insert process therefore has two steps in a transaction
      • 'Unflag' the last Latest_Flag for that device, etc
      • Insert the new row

    It still makes the writes a bit slower (as it needs to do several row updates as well as index updates) but fundamentally it does the pre-calculation for later reads.

    When reading from the table, however, you need to then specify WHERE Latest_Flag = 1. Alternatively, you may want to put it into a view or similar.

    For the filtered index, it may be something like

    CREATE INDEX ix_measurements_deviceproperty_latest 
        ON Measurements (DeviceId, Property)
        INCLUDE (Value, DateTime, Latest_Flag)
        WHERE (Latest_Flag = 1)
    

    Note - another version of this can be done in a trigger e.g., when inserting a new row, it invalidates (sets Latest_Flag = 0) any previous rows. It means you don't need to do the two-step inserts; but you do then rely on business/processing logic being within triggers.