Search code examples
amazon-s3aws-lambdacopyamazon-emr

Copy files from S3 to EMR local using Lambda


I need to move the files from S3 to EMR's local dir /home/hadoop programmatically using Lambda.

S3DistCp copies over to HDFS. I then login into EMR and run a CopyToLocal hdfs command on commandline to get the files to /home/hadoop.

Is there a programmatic way using boto3 in Lambda to copy from S3 to Emr's local dir?


Solution

  • I wrote a test Lambda function to submit a job step to EMR that copies files from S3 to EMR's local dir. This worked.

    emrclient = boto3.client('emr', region_name='us-west-2')
    
    def lambda_handler(event, context): 
    EMRS = emrclient.list_clusters( ClusterStates = ['STARTING', 'RUNNING', 'WAITING'] ) 
    clusters = EMRS["Clusters"] 
    print(clusters)
    for cluster in clusters: 
        ID = cluster["Id"]
        response = emrclient.add_job_flow_steps(JobFlowId=ID,
                                     Steps=[
                                         {
                                             'Name': 'AWS S3 Copy',
                                             'ActionOnFailure': 'CONTINUE',
                                             'HadoopJarStep': {
                                                 'Jar': 'command-runner.jar',
                                                 'Args':["aws","s3","cp","s3://XXX/","/home/hadoop/copy/","--recursive"],
                                             }
                                         }
                                     ],
                                )
    

    If there are better ways to do the copy, please do let me know.