hadoop hdfs benchmarking microsoft-distributed-file-system

Can I plug in a different DFS instead of HDFS with Hadoop?

I'm looking for a way to hook into Hadoop a new file system to benchmark the performance of this new file system against HDFS. I'm new to Hadoop so please feel free to correct me if I've asked the wrong question. If it helps, I'll be using Amazon's EMR.

Solution

You will need to create a Hadoop file system driver for your new file system. This would be a class that extends org.apache.hadoop.fs.FileSystem. Examples of such 'drivers' are the well known DistributedFileSystem aka. HDFS, the LocalFilesystem or S3FileSystem etc. You then have to register your new file system with a scheme in core-site.xml, lets say you register 'gaurav':

<property>
  <name>fs.gaurav.impl</name>
  <value>com.package.GauravFileSystem</value>
</property>

You can now reference files in your own filesystem with the registered scheme: gaurav://somepath/somename. Optionally you can make your new filesystem as the default filesystem by changing fs.default.name. Your cluster should now run on top of your own filesystem (if everything is correct and works, of course).

For example see HADOOP-9629 for an example of a complete Hadoop file system.