azure azure-cosmosdb document-database telemetry nosql

DocumentDB - Storing telemetry data

so a quick update on why I created this question.

We currently are storing our telemetry data of our devices in the field within Azure SQL Server. This is working great (have a ton of experience with EF, LINQ and relationship dbs) BUT I am aware that this most likely isnt the best solution especially for storing "big" data (data is still small for now but will grow within a year).

I have chosen DocumentDB as our possible solution for storing of just our event history. The rest will stay in SQL - users, profiles, device info, sim, vehicle etc as I dont want to completely halt development as we move 100% across to docdb and rather just do whats best short term - cost + performance.

Going through this video I finally came up with a possible solution as to how to store telemetry data - https://www.youtube.com/watch?v=-o_VGpJP-Q0 They recommended One document per time period (example used 1 per hour). Is this the recommended approach still?

    [Index]
    public DateTime TimestampUtc { get; set; }
    public DateTime ReceivedTimestampUtc { get; set; }
    [Index]
    public EventType EventType { get; set; }
    public Guid ConnectionId { get; set; }
    public string RawEventMessage { get; set; }
    [Index]
    public Sender Sender { get; set; }
    [Index]
    public Channel Channel { get; set; }
    public DbGeography Location { get; set; }
    public double? Speed { get; set; }
    public double? Altitude { get; set; }
    public Int16? Heading { get; set; }
    public Byte? HDOP { get; set; }
    public Byte? GPSFixStatus { get; set; }
    public Byte? GPSFixType { get; set; }
    public string Serial { get; set; }
    public string HardwareVersion { get; set; }
    public string FirmwareVersion { get; set; }
    public string Relay1 { get; set; }
    public string Relay2 { get; set; }
    public string Relay3 { get; set; }
    public string Ign { get; set; }
    public string Doors { get; set; }
    public string Input1 { get; set; }
    public string Input2 { get; set; }
    public string Out1 { get; set; }
    public string Out2 { get; set; }
    public int V12 { get; set; }
    public int VBat { get; set; }

Solution

That's one of several possible alternatives. Which is best depends on what your data looks like. For instance, if you have events that vary in their start date/time and duration (or end date/time) or if you track all state changes of entities then something like Richard Snodgrass' temporal data model is ideal. Interestingly Microsoft SQL Server 2016 recently added direct support for temporal tables but they've been in the SQL spec as TSQL2 for a while. Note, the TSQL2 spec includes both valid-time and transaction-time support but I believe that the recent MS SQL 2016 addition only supports valid time... but that's OK since that's what is most valuable. I only point it out because getting your head around how a valid-time table works is hard enough without the added complexity of adding transaction-time.

The beauty of this approach is that you don't have to decide on the needed time granularity as the data is collected, only if/when you aggregate it.

However, as you said, SQL is not ideal for such large data sets. So, I've implemented valid-time Richard Snodgrass style temporal model on top of DocumentDB in my Lumenize library in particular the TimeSeriesCalculator and its other time-series functionality. Read pages 10-19 here for a backgrounder on the data model and common operations in the Lumenize time-series analysis. That deck is for an implementation I did while at Rally called the Lookback API built on MongoDB but the concepts are the same and I've now switched to DocumentDB (but Rally hasn't).

Another comment on your proposed model, you might want to consider a separate document for every reading. It's a bit confusing from the example if there is a document per minute or one per device. If it's one per device per hour, then you can be assured that you'll never go past 60 minutes, which would be OK, but in just about every other way I can think of, it looks like you have the risk of a single document growing unbounded which is a big no-no in DocumentDB (and all NoSQL data modeling). Also, as you say, even if it isn't unbounded, it would involve a lot of in-place updates. Since your system is likely to be write heavy, I would suggest that you might be better off with a single document per reading. If you have to store denormalized aggregations for speed later on, then you still have the option to do that. You may not even need it though. Let the performance of the production system inform that decision.

I suggest that you read up on time-dimensions for star-schemas. It looks a lot like what you are planning, but it's also ideal for the denormalized aggregation storage that I describe. I have not seen any writeups of star schema concepts for NoSQL but here is one from the traditional SQL world that will help you with the concepts.

As I said, there are a lot of alternatives and without knowing more about your situation, I cannot know which is best.