Search code examples
hadoopgoogle-cloud-datastoregoogle-cloud-dataproc

GCP Hadoop data warehouse?


I know Google BigQuery is a data warehouse but is Dataproc, Big Table, Pub/Sub considered a data warehouse? Would that make Hadoop a data warehouse?


Solution

  • A "Data warehouse" is mostly an information systems concept that describes a centralized and trusted source of (e.g. company/business) data.

    From Wikipedia: "DWs are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating analytical reports for workers throughout the enterprise."

    Regarding your questions, a simple answer would be:

    • Google BigQuery is a query execution (and/or data processing) engine that you can use over data stores of different kinds.
    • Google BigTable is a database service that can be used to implement a data warehouse or any other data store.
    • Google DataProc is a data processing service composed by common Hadoop processing components like MapReduce (or Spark, if you consider it part of Hadoop).
    • Hadoop is a framework/platform for data storage and processing comprised of different components (e.g. data storage via HDFS, data processing via MapReduce). You could use an Hadoop platform to build a Data Warehouse, e.g. by using MapReduce to process data and load it into ORC files that will be stored in HDFS and that can be queried by Hive. But it would only be appropriate to call it a data warehouse if it is a "centralized, single version of the truth about data" ;)