Search code examples
apache-pigelastic-map-reduce

How do you use Python UDFs with Pig in Elastic MapReduce?


I really want to take advantage of Python UDFs in Pig on our AWS Elastic MapReduce cluster, but I can't quite get things to work properly. No matter what I try, my pig job fails with the following exception being logged:

ERROR 2998: Unhandled internal error. org/python/core/PyException

java.lang.NoClassDefFoundError: org/python/core/PyException
        at org.apache.pig.scripting.jython.JythonScriptEngine.registerFunctions(JythonScriptEngine.java:127)
        at org.apache.pig.PigServer.registerCode(PigServer.java:568)
        at org.apache.pig.tools.grunt.GruntParser.processRegister(GruntParser.java:421)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:419)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
        at org.apache.pig.Main.run(Main.java:437)
        at org.apache.pig.Main.main(Main.java:111)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.ClassNotFoundException: org.python.core.PyException
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
        ... 14 more

What do you need to do to use Python UDFs for Pig in Elastic MapReduce?


Solution

  • After quite a few wrong turns, I found that, at least on the elastic map reduce implementation of Hadoop, Pig seems to ignore the CLASSPATH environment variable. I found instead that I could control the class path using the HADOOP_CLASSPATH variable instead.

    Once I made that realization, it was fairly easy to get things setup to use Python UDFS:

    • Install Jython
      • sudo apt-get install jython -y -qq
    • Set the HADOOP_CLASSPATH environment variable.
      • export HADOOP_CLASSPATH=/usr/share/java/jython.jar:/usr/share/maven-repo/org/antlr/antlr-runtime/3.2/antlr-runtime-3.2.jar
        • jython.jar ensures that Hadoop can find the PyException class
        • antlr-runtime-3.2.jar ensures that Hadoop can find the CharStream class
    • Create the cache directory for Jython (this is documented in the Jython FAQ)
      • sudo mkdir /usr/share/java/cachedir/
      • sudo chmod a+rw /usr/share/java/cachedir

    I should point out that this seems to directly contradict other advice I found while searching for solutions to this problem:

    • Setting the CLASSPATH and PIG_CLASSPATH environment variables doesn't seem to do anything.
    • The .py file containing the UDF does not need to be included in the HADOOP_CLASSPATH environment variable.
    • The path to the .py file used in the Pig register statement may be relative or absolute, it doesn't seem to matter.