Search code examples
databasehadoop2bigdatanosql

Advantages of Hadoop in combination to any database


There are so many different databases.

  • relational databases
  • nosql databases
    • key/value
    • document store
    • wide columns store
    • graph databases

And database technologies

  • in-memory

  • column oriented

All have their advantages and disadvantages. For me it is very difficult to understand how to evaluate or choose a suitable database for a big data project.

I think about Hadoop, that has many functions to save data in hdfs or access different databases for analytics.

Is it right to say, that hadoop can make it easier to choose the right database, because it can be used at first as a data storage? so if i have hadoop hdfs as my main datastorage, i can still change my database for my application afterwards or use multiple databases?


Solution

  • First and Foremost, Hadoop is not a database. It is a distributed Filesystem.

    For me it is very difficult to understand how to evaluate or choose a suitable database for a big data project.

    The choice of database for a project depends on these factors,

    1. Nature of the data storage and retrieval

      If it is meant for transactions, It is highly recommended that you stick to an ACID database.

      If it is to be used for web-applications or Random Access, then you have wide variety of choices from the traditional SQL ones and to the latest database technologies which support HDFS as storage layer, like HBase. Traditional are well suited for Random Access as they highly support constraints and indexes.

      If analytical batch processing is the concern, based on the structure complexity and volume, choice can be made among all the available ones.

    2. Data Format or Structure

      Most of the SQL databases support Structured data (the data which can be formatted into tables), some do extend their support beyond that for storing JSON and likewise.

      If the data is unstructured, especially flatfiles, storing and processing them can be easily done with any Bigdata supportive technologies like Hadoop, Spark, Storm. Again these technologies will come into picture only if the volume is high.

      Different database technologies play well for different data formats. For example, Graph databases are well suited for storing structures representing relationships or graphs.

    3. Size

      This is the next bigger concern, more the data more the need for scalability. So it is better to choose a technology that supports Scale-Out Architecture (Hadoop, NoSql) than Scale-In. This could become a bottleneck in the future when you plan to store more.

    I think about Hadoop, that has many functions to save data in hdfs or access different databases for analytics.

    Yes, you can use HDFS as your storage layer and use any of the HDFS supported databases to do the processing(Choice of Processing framework is another concern to choose from batch to near real time to real time). To be noted is that Relational databases do not support HDFS storage. Some NoSql databases, like MongoDB, also support HDFS storage.

    if i have hadoop hdfs as my main datastorage, i can still change my database for my application afterwards or use multiple databases?

    This could be tricky depending upon which database you want to pair with afterwards.