Search code examples
mongodbhadoopnosqlhdfsbenchmarking

Hadoop with MongoDB storage


I have a project to use NoSQL DB with Hadoop and benchmark it. I chose MongoDB as a database but I have been confused about something and have some questions that need to be clarified:

  1. Will MongoDB be replacing HDFS or will they be working together and how?

  2. Is benchmarking MongoDB alone different from doing it with Hadoop? Because I feel like at they are the same thing.

  3. I found YCSB tool for benchmarking. Can it benchmark them together?

  4. I know that MongoDB can work on cluster, when monogo on top of Hadoop , will the data be shared among nodes by MongoDB or by Hadoop?

I hope you clarify these concepts and thank you in advance.


Solution

  • Will MongoDB be replacing HDFS

    Absolutely not. HDFS is not meant to be used as a database, and Mongo is not a distributed filesystem capable of storing Petabytes of any data

    will they be working together and how?

    HIve and Spark can read data from Mongo directly. I'm sure there's other tools that can backup Mongo into HDFS.

    Is benchmarking MongoDB alone different from doing it with Hadoop

    Yes, reads and writes will be vastly different tuning parameters than HDFS, because HDFS is not a database

    YCSB tool for benchmarking

    Its not clear what you're benchmarking in Hadoop. Writing and reading a bunch of files (with and without mapreduce)? Seeing how many jobs run in YARN at a given time? Hadoop again isn't a database meant to store simple JSON blobs.

    when monogo on top of Hadoop , will the data be shared among nodes by MongoDB or by Hadoop?

    I've never heard of this, but maybe indicies are stored by Mongo, and raw data served by HDFS?