Search code examples
databasehiverdbmsdata-warehousehadoop2

Can we store multiple types of data in a data warehouse?


I want to ask that can we store various types of data in a Hadoop data warehouse? Data like RDBMS, JSON Doc, Cassandra Keyspace, txt, CSV, etc? Are they all stored in HDFS?


Solution

  • Classic DWH is a repository for structured, filtered data that has already been processed for a specific purpose and all the data is being stored in the same format except landing zone (LZ or RAW) where data can be stored in the same format as it is loaded from source systems. DHW building process is based on Kimball or Inmon theory.

    What you are asking about is a Data Lake - a modern concept - is a vast pool of raw data, the purpose for which can be not completely defined yet. In a DL you can store all structured alond with semi-structured data and data analysts can access both RAW semi-structured data and structured data in 3NF or dimentional form.

    RDBMS normally add the abstraction layer between internal storage representation and means how it can be accessed, though storing data in external files in HDFS is possible for many RDBMS, this is used for convenient integration with Data Lake.

    Yes, you can store everything in the same DL: semi-structured data, data in different storage formats like AVRO, CSV, Parquet, ORC, ETC, build Hive tables on it as well as different RDBMs tables, all can be stored in the same HDFS/S3/Azure/GCS/etc

    Some layers are also can be created in DL like RAW/LZ/DM or based on domain event/business event model, this means that DL is not an absence of architecture constraints, normally you have some architecture design, and architecture constraints to follow in DL as well as in classic DWH.