Search code examples
javascriptamazon-web-servicesaws-lambdaaws-sdkamazon-emr

Timeout while triggering AWS EMR flow step from AWS Lambda


I am trying to run an AWS lambda application in JavaScript, but I can't make it work properly. I don't have any troubles with the JS configuration and triggering (I successfully runned a hello world app), but I'm experiencing problems with the aws-sdk library. To be honest, I don't know if this is a problem related to network configuration or to IAM configuration, but I'm pretty sure it's not a scripting issue, because I can run it without any problem locally in my computer. The main problem I have is that when the lambda app calls the AWS EMR API, there is a timeout error. It's like lambda is not able to communicate to EMR.

Here, you can see the emr client (console.log(emr_client)):

  emr: Service {
    config: 
     Config {
       credentials: 
        EnvironmentCredentials {
          expired: false,
          expireTime: null,
          accessKeyId: 'XXXXXXXXXXXXXXXX',
          sessionToken: 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
          envPrefix: 'AWS' },
       credentialProvider: CredentialProviderChain { providers: [Array] },
       region: 'us-west-2',
       logger: null,
       apiVersions: {},
       apiVersion: '2009-03-31',
       endpoint: 'elasticmapreduce.us-west-2.amazonaws.com',
       httpOptions: { timeout: 120000 },
       maxRetries: undefined,
       maxRedirects: 10,
       paramValidation: true,
       sslEnabled: true,
       s3ForcePathStyle: false,
       s3BucketEndpoint: false,
       s3DisableBodySigning: true,
       computeChecksums: true,
       convertResponseTypes: true,
       correctClockSkew: false,
       customUserAgent: null,
       dynamoDbCrc32: true,
       systemClockOffset: 0,
       signatureVersion: 'v4',
       signatureCache: true,
       retryDelayOptions: {},
       useAccelerateEndpoint: false,
       accesKeyId: 'XXXXXXXXXXXXXXXX' },
    isGlobalEndpoint: false,
    endpoint: 
     Endpoint {
       protocol: 'https:',
       host: 'elasticmapreduce.us-west-2.amazonaws.com',
       port: 443,
       hostname: 'elasticmapreduce.us-west-2.amazonaws.com',
       pathname: '/',
       path: '/',
       href: 'https://elasticmapreduce.us-west-2.amazonaws.com/' },
    _clientId: 1 
    }

Some AWS config information:

  1. I created a VPC where my EMR cluster resides, located in us-west-2 region, and I'm triggering the lambda function there (as I can confirm consoling process.env.AWS_REGION).

  2. I setted up a subnet that was previously created inside this same VPC. The EMR cluster is inside it and the Lambda function has access to it.

  3. I setted up a security group in this same VPC with all inbounds/outbounds allowed (all ports from and to 0.0.0.0/0) to see if I had a configuration problem there.

  4. I setted up an execution role that has the following policies attached and linked it with my lambda function:

AWSLambdaFullAccess

AmazonElasticMapReduceFullAccess

AWSLambdaExecute

AWSLambdaVPCAccessExecutionRole

AWSLambdaRole

AWSLambdaENIManagementAccess

Finally, my code:

const AWS = require('aws-sdk');

exports.handler = (event, context, callback) => {
  const emr = new AWS.EMR({
    apiVersion:'2009-03-31',
    region: process.env.AWS_REGION,
    accessKeyId: process.env.AWS_ACCESS_KEY_ID,
    secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
  });

  const flowSteps = {
    JobFlowId: process.env['JOB_FLOW_ID'],
    Steps: [{
      Name: "my_beautiful_step",
      ActionOnFailure: "CANCEL_AND_WAIT",
      HadoopJarStep: {
        Jar: "command-runner.jar",
        Args: [
          "spark-submit",
          "--master"," yarn",
          ...
          ...
          ...
        ]
      }
    }]
  };

  emr.addJobFlowSteps(flowSteps, (err, data) => {
    if (err) {
      console.log('ERROR', err, err.stack);
    } else {
      console.log('NO ERROR', data);
    }
  });

};

EDIT: I tried communicating to s3 (getting a bucket location) just to test if the problem was only with EMR, but the function also timouts.


Solution

  • Well, I solved my issue. Basically, you can't call AWS API endpoints inside a VPC if you don't have internet access, because most of the aws services have a public URL, e.g., https://elasticmapreduce.us-west-2.amazonaws.com. You can clearly see this when you console the EMR client object (and this applies too for other client objects such as S3 as I verified)

    Service {
      config: 
       Config {
         ...
         ...
         region: 'us-west-2',
         logger: null,
         apiVersions: {},
         apiVersion: null,
         endpoint: 'elasticmapreduce.us-west-2.amazonaws.com',
         httpOptions: { timeout: 120000 },
         maxRetries: undefined,
       },
      endpoint: 
       Endpoint {
         protocol: 'https:',
         host: 'elasticmapreduce.us-west-2.amazonaws.com',
         port: 443,
         hostname: 'elasticmapreduce.us-west-2.amazonaws.com',
         pathname: '/',
         path: '/',
         href: 'https://elasticmapreduce.us-west-2.amazonaws.com/' 
        },
      ...
    }
    

    Anyways, AWS provides some local endpoints inside vpcs VPC Endpoints so you can access to those services endpoints inside the VPC without internet access. In another case, you have to set a NAT gateway + internet gateway (~u$s 30/month) to access to other services such as EMR.