Quite long post so let’s start with some context: Weather data have a central role in our architecture. A weather data is mainly composed of five values:
But we could also have some more custom values.
Our specificities are:
Missing values: All this five values are not always available from one weather station. Sometimes we need to take missing values from the nearest weather stations (ex: global radiation)
Sampling rate: For one given weather station, sampling rates can be different between the five values.
Virtual stations: We also have special “virtual” weather stations, that are composed of separated weather sensors (from real weather stations).
In all cases, at the end of the acquisition process, for each event in a weather station (real or virtual) we need to calculate some higher level indices from this five values. Some of this five values or higher level indices are aggregated daily.
We plan to use Spark for data processing.
Which of these three models is the most relevant and that would not deprive us of Cassandra benefits?
How to manage relations between sensors and weather_stations (missing data and virtual stations)?
CREATE TABLE sensor_data {
sensor_id uuid
day text,
timestamp timestamp,
sensor_type text,
value double,
weather_station_id
PRIMARY KEY ((sensor_id, day), timestamp)
}
CREATE TABLE weather_data {
weather_station_id uuid,
day date,
timestamp timestamp,
sensor_data seq<sensor_data>
PRIMARY KEY ((weather_station_id, day), timestamp)
}
CREATE TABLE weather_temperature {
sensor_id uuid,
day text,
timestamp timestamp,
value double,
weather_station_id
PRIMARY KEY ((sensor_id, day), timestamp)
}
CREATE TABLE weather_rain {
...
}
Same for all kind of measurements. Then from these tables we need to process data to aggregate everything, filling missing values and repeating values with lower sampling rate.
CREATE TABLE weather_data (
weather_station_id uuid,
day date,
timestamp timestamp,
...
PRIMARY KEY ((weather_station_id, day), timestamp)
);
CREATE TABLE weather_data (
weather_station_id text,
day date,
timestamp timestamp,
temperature float,
rain float,
global_radiation float,
relative_humidity float,
wind_speed float,
wind_direction float,
PRIMARY KEY ((weather_station_id, day), timestamp)
);
Then we fill a weather_data_processed table with virtual station, missing values and repeat values with lower sampling rate.
To answer this question it's important to understand how you plan to query the data with cql. Can you provide a bit more information about that?
To solve the problem of 'virtual weather stations' you could go with something like this:
CREATE TABLE weather_data_station_sensor (
weather_station_id text,
sensor_id text,
day date,
timestamp timestamp,
temperature float,
rain float,
global_radiation float,
relative_humidity float,
wind_speed float,
wind_direction float,
PRIMARY KEY ((weather_station_id, day), sensor_id, timestamp)
);
And use the same table for real and virtual stations. When you get a reading from a sensor that belongs in both a real station and a virtual station you do two (or more updates) using a BATCH. For example:
BEGIN BATCH
INSERT INTO weather_data_station_sensor (weather_station_id, sensor_id ...etc) VALUES ('station_1', 'id_1' ... etc);
INSERT INTO weather_data_station_sensor (weather_station_id, sensor_id ...etc) VALUES ('station_2', 'id_1' ... etc);
APPLY BATCH