amazon-web-services amazon-s3 cassandra etl olap

How to understand an OLAP cube in S3 or Cassandra?

In this repository, its author mentions that we can stage OLAP cubes in Cassandra or S3:

Once the data is in Redshift, our chief goal is for the BI apps to be able to connect to Redshift cluster and do some analysis. The BI apps can either directly connect to the Redshift cluster or go through an intermediate stage where data is in the form of aggregations represented by OLAP cubes.

How is it possible? How would that work? Am I missing any essential concept? As I understand OLAP cubes are a special data structure that exists in OLAP databases. Does he maybe mean specific pre-calculated combinations of dimensions and facts stored in a OLTP-oriented database, like Cassandra?

Solution

Key features of OLAP are:

pivoting
slicing
dicing
drilling

And Redshift can do this.

It's architecture is aimed to solve OLAP and BI tasks. See amazon-redshift-developer-guide

Amazon Redshift is specifically designed for online analytic processing (OLAP) and business intelligence (BI) applications, which require complex queries against large datasets. Because it addresses very different requirements, the specialized data storage schema and query execution engine that Amazon Redshift uses are completely different from the PostgreSQL implementation. For example, where online transaction processing (OLTP) applications typically store data in rows, Amazon Redshift stores data in columns, using specialized data compression encodings for optimum memory usage and disk I/O. Some PostgreSQL features that are suited to smaller-scale OLTP processing, such as secondary indexes and efficient single-row data manipulation operations, have been omitted to improve performance.

But the line between terms is very smooth.

As Diana Shealy said:

Stop Abusing OLTP as OLAP

There’s a lot of confusion in the market between OLTP and OLAP, and due to the high price of commercial OLAPs, startups and budget-constrained developers have gone on to abuse an OLTP database as an OLAP database. The abuse falls into two categories:

An often multi-shard MySQL database with application layer scripting to perform historical event data analysis. Although this setup is extremely common, it is one of the least productive ways to approach analytics. MySQL is not optimized in any way for reading large ranges of data and its support for analytic functions is weak. As there are multiple alternatives, avoid this “inexpensive” solution because you’ll be paying the price in other places eventually.

Using PostgreSQL as an OLAP layer. This is a more legitimate choice than above for starting an analytics platform because of Postgres’s solid analytic User Defined Functions (UDFs). Also, thanks to its c-store extension, PostgreSQL can be turned into a columnar database, making it an affordable alternative to commercial OLAPs.

Finally, if you are considering moving from OLTPs abused as OLAPs to “real” OLAPs like Redshift, I encourage you to learn how to use Redshift’s COPY Command so that you can start seeing your data inside Redshift.

As for your questions:

How is it possible?

It's possible due to Redshift architecture (column database) and analytical features such as:

How would that work?

See System and Architecture Overview for a detailed explanation of the Amazon Redshift data warehouse system architecture.

(Some links are already mentioned before in this post)

Essential concept?

Am I missing any essential concept?

I'd suggest more rely on technical details of specific solution instead of marketing terms. In the end, practical tasks are not solved by software naming or marketing, but with it's real functionality.

What's really important in DB landscape - is to consider two theorems:

CAP theorem

According to Iron triangle of CAP theorem, you can choose two points of three DB architecture components: * consistency * availability * persistence

PIE theorem

Rick Houlihan of Amazon had a speech on choosing the DB archotecture. In addition to the CAP theorem, he also presented PIE theorem:

The PIE theorem posits that you can choose two out of three desirable features in a data system:

Pattern Flexibility
Efficiency
Infinite Scale

And Redshift is on PI dimension of the PIE triangle

Data structure

As I understand OLAP cubes are an special data structure that exists in OLAP databases. Does he maybe mean specific pre-calculated combinations of dimensions and facts stored in a OLTP-oriented database, like Cassandra?

Both OLAP aggregated data structures and Redshift distribution styles aimed one goal: make queries faster. Column DB, distribution, parallel queries and other features are good for analytical tasks.

UPD

In comments you asked if Cassandra can work as OLAP service. Cassandra and S3 can be used as a storage for pre-calculated aggregated data of dimensions.