I am new in PipelineDB and stream analytics,
I have these two SQL commands;
CREATE CONTINUOUS VIEW timing_hashtags WITH (sw = '1 minutes')
AS SELECT h, minute(arrival_timestamp) as minuteOfArrival, COUNT(*) as quantity
FROM hashtag_stream GROUP BY h, minuteOfArrival;
AND
CREATE CONTINUOUS VIEW timing_hashtagsTTL WITH (ttl = '1 minute', ttl_column = 'minuteOfArrival')
AS SELECT h, minute(arrival_timestamp) as minuteOfArrival, COUNT(*) as quantity
FROM hashtag_stream GROUP BY h, minuteOfArrival;
When I run the following query on both continuous views;
SELECT * FROM timing_hastags order by minuteOfArrival desc;
Result is the same on both timing_hastags and timing_hastagsTTL continuous views;
Can somebody please help me to understand the difference between the usage of "ttl" and "sw" operator on continuous views.
Thank you.
I'll define each one here to hopefully clarify the distinction between the two:
TTL - A hint to the reaper that any rows older than this can be deleted in the background. TTL-expired rows may thus still be visible at read time if the reaper hasn't gotten around to removing them.
Sliding window - Only consider data within this window at read time. Data outside of the specified window will never be visible at read time. Sliding windows use TTLs internally to expire old rows, and they also do a final aggregation over the sliding window at read time.
Can somebody please help me to understand the difference between the usage of "ttl" and "sw" operator on continuous views.
CREATE CONTINUOUS VIEW timing_hashtags WITH (sw = '1 minutes')
AS SELECT h, minute(arrival_timestamp) as minuteOfArrival, COUNT(*) as quantity
FROM hashtag_stream GROUP BY h, minuteOfArrival;
Whenever a sliding window CV includes a timestamp column, it's almost always better to just use a TTL with no sliding window. The reason is that a final aggregation over the last 1 minute will be performed at read time, which is likely unnecessary since each row is already aggregated at a minute level. And internally, the SW CV would aggregate at a higher granularity than 1 minute (e.g. there would be many rows per minute) and aggregate those rows over the last minute at read time.
Removing the minuteOfArrival from this CV definition may make the SW semantics clearer to you. Even without minuteOfArrival, you'd only ever see data for the last minute when reading from the CV.