Search code examples
hadoophivehbasehdfsazkaban

Relationship between HDFS, HBase, Pig, Hive and Azkaban?


I am somewhat new to Apache Hadoop. I have seen this and this questions about Hadoop, HBase, Pig, Hive and HDFS. Both of them describe comparisons between above technologies.

But, I have seen that, typically a Hadoop environment contains all these components (HDFS, HBase, Pig, Hive, Azkaban).

Can some one explain the relationship of those components/technologies with their responsibilities inside a Hadoop environment, in an architectural workflow manner? preferably with an example?


Solution

  • General Overview:

    HDFS is Hadoop's Distributed File System. Intuitively you can think of this as a filesystem that spans many servers.

    HBASE is a column oriented datastore. It is modeled after Google's Big Table, but if that's not something you knew about then think of it as a non-relational database that provides real time read/write access to data. It is integrated into Hadoop.

    Pig and Hive are ways of querying for data in the Hadoop ecosystem. The main difference being that Hive is more like SQL than Pig. Pig uses what is called Pig Latin.

    Azkaban is a prison, I mean batch workflow job scheduler. So basically it's similar to Oozie in that you can run map/reduce, pig, hive, bash, etc as a single job.

    At the highest level possible, you can think of HDFS being your filesystem with HBASE as the datastore. Pig and Hive would be your means of querying from your datastore. Then Azkaban would be your way of scheduling jobs.

    Stretched Example:

    If you're familiar with Linux ext3 or ext4 for a filesystem, MySQL/Postgresql/MariaDB/etc for a database, SQL to access the data, and cron to schedule jobs. (You can interchange ext3/ext4 for NTFS and cron for Task Scheduler on Windows)

    HDFS takes the place of ext3 or ext4 (and is distributed), HBASE takes the database role (and is non-relational!), Pig/Hive are a way to access the data, and Azkaban is a way to schedule jobs.

    NOTE: This is not an apples to apples comparison. It's just to demonstrate that the Hadoop components are an abstraction meant to give you a workflow that you're already probably familiar with.

    I highly encourage you to look into the components further as you'll have a good amount of fun. Hadoop has so many interchangeable components ( Yarn,Kafka,Oozie,Ambari, ZooKeeper, Sqoop, Spark, etc) that you'll be asking this question to yourself a lot.

    EDIT: The links you posted went more into detail about HBase and Hive/Pig so I tried to give an intuitive picture of how they all fit together.