I am having trouble integrating EMR with S3 i.e to implement EMRFS
EMR Version: emr-5.4.0
When I run hdfs dfs -ls s3://pathto/bucket/
I get following error
ls: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: XXXX), S3 Extended Request ID: XXXXX**
Please guide what is that, what I am missing ?
I have done following steps
Created a EMR cluster manually with new EMR Security policy and following configuration classification
"fs.s3.enableServerSideEncryption": "true",
"fs.s3.serverSideEncryption.kms.keyId":"KEYID"
EMR, by default, will use instance profile credentials(EMR_EC2_DefaultRole) to access your S3 bucket. The error means this role does not have necessary permissions to access S3 bucket.
You will need to verify the IAM Role policy of that role to allow necessary S3 actions on both bucket and objects (Like s3:list*). Also check if you have any explicit Deny's etc. http://docs.aws.amazon.com/AmazonS3/latest/dev/using-with-s3-actions.html
The access could also be denied because of a Bucket policy on set on the S3 bucket you are trying to access. http://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucket-policies.html https://aws.amazon.com/blogs/security/iam-policies-and-bucket-policies-and-acls-oh-my-controlling-access-to-s3-resources/
Your EMR cluster could be using an VPC endpoint for S3 to access S3 rather than Internet/NAT. In that case, you'll also need to verify VPC endpoint policies as well. https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints-s3.html#vpc-endpoints-policies-s3