Search code examples
amazon-ec2amazon-efs

Does AWS EFS can be accessed from multiple Hadoop clusters


I can understand that EFS can be mounted to multiple EC2 instances.

Is it possible to connect to AWS EFS from multiple Hadoop clusters?

Or is it attached to specific cluster?

Can we connect to EFS outside the Hadoop Clusters using API?


Solution

  • You are using a Cloudera distribution for your Hadoop cluster, so you can configure whatever you wish.

    As a comparison, users of Amazon EMR (the AWS managed Hadoop service) normally choose from two types of storage:

    • Instance store: This is directly-attached disk storage, so it is very fast. Some instance types (eg m3, d2) offer large magnetic-disk storage, which is excellent for HDFS. Other instance types offer very fast SSD storage, but this is normally smaller in size. Please note that the contents of Instance Store is lost when the EMR cluster is terminated.
    • EBS Volumes: These are network-attached disks that offer much larger storage (up to 16TB per volume). Again, the contents is lost when the EMR cluster is terminated. EBS volumes and Instance Store can also be used together.

    For EMR (again, not your situation), users keep input and output data in Amazon S3 as a persistent data store. This way, data is not lost when the cluster is terminated. The benefit is that clusters can be turned off when they are not used (hence, saving money) and additional clusters can be spun-up when more processing power is required. This is not possible in a traditional on-premises setup where clusters are permanently kept on and cannot be scaled up or down.

    So, back to your Cloudera cluster... You will probably be using HDFS for your storage, in which case you would want attached disk storage. You also have the option of using S3 for storage of data, which can work out cheaper than disk storage.

    Yes, you could attach Amazon EFS volumes via NTFS, but EFS is normally used for sharing disks between EC2 instances and this is not the way that HDFS operates (it assumes locally-attached disks with the distributed sharing happening at the NodeManager level).

    I would recommend investigating whether you could use Amazon EMR instead of deploying your own Hadoop cluster due to the benefits of scaling, transient clusters, automatic deployment and regular upgrades. If you must use Cloudera, you will be responsible for managing and maintaining the cluster yourself.