Search code examples
hadoopcompressionemrbzip2

How to let EMR execute customer jar first


Because hadoop 1.0.3 doesn't support bzip2 decompress, so I copied the same classes from hadoop 2.2 into my project, but my project (or we call it jar) is still running on hadoop 1.0.3 cluster. I found hadoop still execute the the classes from 1.0.3 i.e the new classes were not executed. How can I configure to use the classes in myself's jar firstly. I know we may use something like: hadoop jar collect_log.jar com.TestCol -Dmapreduce.task.classpath.user.precedence=true
But right now I'm using EMR, so I don't know how to set the precedence in EMR. Thanks a lot!


Solution

  • EMR referees its hadoop jars from location /home/hadoop/lib You can try using bootstrap scripts to copy your new jars to this location.

    Other option is when you launch emr . Connect to master node using ssh and key file and see ps -ef | grep java.

    it will show current hadoop process and its jar orders ( class path) Later you can make changes in Bootsraop script to change class paths a per your new order

    edited to add sample bootstrap script mybootstrap.sh

    #!/bin/bash
    hadoop fs -copyToLocal s3n://bucket/bootstrap/abc.jar /home/hadoop/lib/
    

    upload this script to s3 bucket and assign it to emr launcher code as

            RunJobFlowRequest request = new RunJobFlowRequest(.....
            ScriptBootstrapActionConfig bootstrapScriptConfig = newScriptBootstrapActionConfig();
            bootstrapScriptConfig.setPath(CONFIG_HADOOP_BOOTSTRAP_ACTION);
    
            BootstrapActionConfig bootstrapConfig = new BootstrapActionConfig();
            bootstrapConfig.setName("copy jar file");
            bootstrapConfig.setScriptBootstrapAction(bootstrapScriptConfig);
            request.withBootstrapActions(bootstrapConfig);
    

    Here CONFIG_HADOOP_BOOTSTRAP_ACTION will be path for your bootstrap file.