Search code examples
virtualenvamazon-emramazon-ami

How to configure python virtual step for hadoop in ami 4.x


In ami 3 the file /home/hadoop/conf/hadoop-user-env.sh existed. And this legacy code I'm looking at was able to run this command in bootstrapping.

echo ". /home/hadoop/resources/pips/bin/activate" >> /home/hadoop/conf/hadoop-user-env.sh

This activates virtual env for Python.

In ami 4 this file is gone. How am I suppose to get a python step in Hadoop to run in virtual env under ami 4?


Solution

  • Going to give this a shot and hope it helps you.

    In Amazon EMR AMI versions 2.x and 3.x, there was a hadoop-user-env.sh script which was not part of standard Hadoop and was used along with the configure-daemons bootstrap action to configure the Hadoop environment. The script included the following actions:

    #!/bin/bash 
    export HADOOP_USER_CLASSPATH_FIRST=true; 
    echo "HADOOP_CLASSPATH=/path/to/my.jar" >> /home/hadoop/conf/hadoop-user-env.sh
    

    In Amazon EMR release 4.x, you can do the same now with the hadoop-env configurations:

    [ 
      { 
         "Classification":"hadoop-env",
         "Properties":{ 
    
         },
         "Configurations":[ 
            { 
               "Classification":"export",
               "Properties":{ 
                  "HADOOP_USER_CLASSPATH_FIRST":"true",
                  "HADOOP_CLASSPATH":"/path/to/my.jar"
               }
            }
         ]
      }
    ]
    

    There is more info about the differences and replacement codes on Amazon's Documentation Site.