I have an application that collects data from about a dozen sensors in a Smart House. It stores its data in a MySQL database with the following table format:
CREATE TABLE IF NOT EXISTS `datapoints` (
`PointID` int(11) NOT NULL,
`System` varchar(50) NOT NULL,
`Sensor` varchar(50) NOT NULL,
`Value` varchar(50) NOT NULL,
`Timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`PointID`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The System
field is for grouping sensors, for example, the "Air" system has "Temperature" and "Humidity" sensors, and the "Solar Panel" system has "Power production (kW)" and "Production Today (kWh)". The fields are all varchar
because there are several datatypes coming in from the sensors and the original database designer took the easy way out. (I know that this data structure isn't very efficient but it's too late to change it.)
The sensors include air temperature, humidity, solar panel output, solar water-heater temperature, and others.
We now have this database which collects tens of thousands of data points every day. Until now, we have used an application that queries the database to build graphs of sensor data over time. However, we now have many gigabytes of data and will eventually run out of storage space on the logging hardware.
I am looking for a way to collect statistics from the data and then delete it. I am thinking of something a lot like [Google Analytics | Piwik | Awstats] for the data. The problem is that I don't really know where to start. I want to be able to look at more detailed data from more recent times, for example:
However, I don't want to smooth the data too much. Eventually, I will have to smooth the data but I want to keep it detailed as long as possible. For example, if I have a big spike in power production, if it is smoothed (lower) into the hourly data, and then again (lower) in the daily data, and then again (lower) in the weekly data, the week's data will not reflect that there was a spike, as the average of averages isn't the same as the average of all the points.
Is there any software that already does this? If not, what is a good way to start? I can do it in any language, but preference goes to .NET, PHP, or C (for Windows), (in that order) because those are languages that other project members already know and that the logging hardware already has set up.
If I were still doing this project today (and for other projects of this type), I would use a Time Series Database (TSDB).
A TSDB is specifically designed for ingesting large volumes of data points over time and allowing analysis of them.
I have now been playing with the TimeScale extension for PostgreSQL for another project, and it would have done exactly what I needed.