I'm new to PipelineDB and have yet to even experience it at runtime (installation pending ...). But I'm reading over the documentation and I'm totally intrigued.
Apparently, PipelineDB is able to take set-based query representations and mechanically transform them into an incremental representation for efficiently processing the streams of deltas with storage limited as a function of the output of the continuous view.
Is it also supported to run the set-based query as a set-based query for priming a continuous view? It seems to me that upon creation of a Continuous View the initial data would be computed this traditional way. Also, since Continuous Views can be truncated, can they then be repopulated (from still-available source tables) without tearing down whatever dependent objects it has to allow a drop/create?
It seems to me that this feature would be critical in many practical scenarios. One easy example would be refreshing occasionally to reset the drift from rounding errors in, say, fractional averages.
Another example is if there were bug discovered and fixed in PipelineDB itself which had caused errors in the data. After the software is patched, the queries based on data still available ought to be rerun.
Continuous Views based fully on event streams with no permanent storage could not be rebuilt in that way. Not sure about if only part of the join sources are ephemeral.
I don't see these topics covered in the docs. Can you explain how these are or aren't a concern?
Thanks!
Jeff from PipelineDB here.
The main answer to your question is covered in the introduction section of the PipelineDB technical docs:
"PipelineDB can dramatically reduce the amount of information that needs to be persisted to disk because only the output of continuous queries is stored. Raw data is discarded once it has been read by the continuous queries that need to read it."
While continuous views only store the output of continuous queries, almost everybody who is using PipelineDB is storing their raw data somewhere cheap like S3. PipelineDB is meant to be the realtime analytics layer that powers things like realtime reporting applications and realtime monitoring & alerting systems, used almost always in conjunction with other systems for data infrastructure.
If you're interested in PipelineDB you might also want to check out the new realtime analytics API product we recently rolled out called Stride. The Stride API gives developers the benefit of continuous SQL queries, integrated storage, windowed queries, and other things like realtime webhooks, all without having to manage any underlying data infrastructure, all via a simple HTTP API.
If you have any additional technical questions you can always find our open-source users and dev team hanging out in our gitter chat channel.