Search code examples
bigdatadata-modelingdata-warehousedata-lake

What is a Data Warehouse and can it be applied to complex data?


I want to define data warehouse with the necessary literature reference.

I found on wikipedia that wiki

DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise.

does that imply that it is always a relational data base underneath an data warehouse or can it be any kind of repository?

In An Architecture Framework for Complex Data Warehouses the term data warehouse is also used for complex data which means video, images etc. but the term data warehouse remains undefined in that paper.


Solution

  • A "Data warehouse" is mostly an information systems concept that describes a centralized and trusted source of (e.g. company/business) data.

    From Wikipedia: "DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise."

    I consider the Kimball Group one of the most authoritative sources on the topic, as they have been developing their framework and methodologies for over two decades, and they've also been applying that framework to different business and technical areas and sharing the outcomes and results of this.

    Kimball's The Data Warehouse Toolkit is one of the reference books on the topic, and it defines a data warehouse as "a copy of transaction data specifically structured for query and analysis".

    Bill Inmon is also considered one of the pioneers of data warehousing, and defines a data warehouse as "a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process"

    A data warehouse does not have to be implemented on a relational database system, although it is very common to implement Kimball's dimensional models in RDBMS or different database systems that support the concepts of "joinable" tables (e.g. Redshift, Presto, Hive).

    A recent addition to data architectures, which perfectly accomodates complex data types, is the concept of a data lake, which is usually a data store that can handle virtually any kind of data types (e.g. S3, HDFS) which can either be analyzed directly (e.g. MapReduce over XML files on S3) or processed into different formats or data models (like a dimensional model).

    Edit following your comment:

    A Data Warehouse and a Data Lake are independent systems that serve different purposes, can/should be complementary, and both are part of a larger data architecture. A data lake, as a concept, can be just another data source for dimensional models on a data warehouse (although the technological implementation of data lakes enables direct querying over the raw data).

    You can think of a Data Lake as a "landing zone" where several systems dump data in a "complex/raw format", e.g. MP3 files from customer support calls, gzipped logs from web servers. It's meant to sit there for historical purposes and for further processing into a format that can be easily analyzed/reported over, e.g. text extraction from MP3 files.

    A Data Warehouse also aggregates data from different systems, but the data is modeled into a format appropriate for reporting (like a dimensional model), its model reflects the business/domain's processes and transactions, and is usually highly curated.

    Imagine the case: if you log visits to your online store using web server logs, you could keep the gzipped logs (the "transaction data") in a data lake and then process the data into a dimensional model (like this) which will be the "copy of transaction data specifically structured for query and analysis", so business users can easily explore it in Excel or some other reporting tool.