Search code examples
hadoopcephglusterfs

GlusterFS or Ceph as backend for Hadoop


Has anyone tried to use GlusterFS or Ceph as the backend for Hadoop? I am not talking about just use plugin to sew things up. Is the performance better than HDFS itself? whether it's ok for production usage.

Also, Is it a really good idea to merge object storage, hadoop hdfs storage all together as a single storage? or it's better keep them separated.


Solution

  • I have tried Ceph as "drop-in" HDFS replacement in Hadoop 2.7 and after solving many integration issues have found it two/three times slower than HDFS with default replication factor in terasort benchmark. I don't know the reason for this. Other folks tried different approach with similar result:

    http://www.snia.org/sites/default/files/SDC15_presentations/cloud_files/YuanZhou_big_data_analytics_on_object_store_r3.pdf

    Is it good idea to combine object and hdfs storage? I think the question is not correct. Both HDFS (via Ozone and FUSE) and Ceph provide ability to use them as object storage and regular POSIX filesystems, with Ceph having an edge offering block storage as well, while HDFS this is currently discussed: https://issues.apache.org/jira/browse/HDFS-11118 If it is a question of "can I expose my storage as POSIX FS, Object, Block store at the same time?" Then the answer would be if your design satisfy your requirements for scalability and high availability, it could be a great idea actually.