Search code examples
apache-sparkkubernetesamazon-s3hdfsamazon-emr

Using AWS EMRFS in apache spark hosted on ec2


If I am running spark on ec2 (or in kubernetes), can I use s3/emrfs in place of hdfs? Is this production ready and does it use parallelism to read/process data from s3?

Thanks in advance


Solution

  • No, EMRFS is for EMR only, the easy way to make S3 look like part of HDFS. For EC2 you connect to S3, but that is less easy than with EMR. S3 is not tightly coupled to EC2. Yes, parallelism is applied but not according to MR data locality, worker and data node that is.