amazon-web-services architecture pipeline aws-iot aws-iot-analytics

How to design an AWS IoT Analytics Pipeline that will have separate data-set for each device?

I have a mobile application that fetches data from sensors and pushes this data to AWS IoT Core Topic. I want to relay this data to AWS IoT Analytics and then analyze it with my own machine learning code - using container data-sets. The important thing is to make sure that the events are segregated and batched by device_id and analyzed in 30 minute time-windows. In my case it only makes sense to analyze together a group of events that are generated by the same device_id. The event payload already contains the unique device_id property. The first solution that comes to mind is to have a separate Channel -> Pipeline -> DataStore -> SQL DataSet -> Container Data Set setup for each of the mobile clients. Visually depicted that looks like this: Given the number of devices is N, the problem with this architecture is that I will need to have N channels, N pipelines which are actually identical, N data stores which store identical type/schema of data and finally 2*N Data Sets. So if I have 50.000 devices the number of resources is huge. This makes me realize this is not a good solution. The next idea that comes to my mind is to have only one Channel, one Pipeline and one Datastore for all devices and only have different SQL Data sets and different Container Data sets for each device. That looks like this: This architecture feels much better now but if I had 50.000 devices I'd still need 100.000 different data sets. The default AWS limit is 100 data-sets per account. Of course I can request a limit increase but if the default limit is 100 data sets then I am wondering if it makes sense to request limit increase which is x1000 times the default one? Is any of these 2 architectures how AWS IoT Analytics is supposed to be used or am I missing something?

Solution

I posted the same question on the AWS Forum and I got a helpful answer from an engineer who works there. I am posting his answer here for those who might have a similar architecture requirements like me:

I don't think a dataset per user is the right way to model this. The way we'd recommend the data architecture would be to use a single dataset (or maybe a small number of datasets pivoted by device type, country or other higher level grouping) and have a SQL query that extracts data for the time period of interest, 30 minutes in your case. Next you trigger a container dataset that consumes the dataset and prepares the final analysis you need per user. The notebook would basically iterate over every unique customer id (you may have been able to do grouping and ordering in the SQL to make this faster) and perform the analysis you need before sending that data where needed. You could have 1 container dataset to do the initial data processing per customer and a second container dataset to do the ML training depending on the complexity of the scenario, but for many cases a single container dataset will be fine - I've used this approach to train tens of thousands of individual 'devices' so this may also work for your use case.