I really want to take advantage of Python UDFs in Pig on our AWS Elastic MapReduce cluster, but I can't quite get things to work properly. No matter what I try, my pig job fails with the following exception being logged:
ERROR 2998: Unhandled internal error. org/python/core/PyException
java.lang.NoClassDefFoundError: org/python/core/PyException
at org.apache.pig.scripting.jython.JythonScriptEngine.registerFunctions(JythonScriptEngine.java:127)
at org.apache.pig.PigServer.registerCode(PigServer.java:568)
at org.apache.pig.tools.grunt.GruntParser.processRegister(GruntParser.java:421)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:419)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:188)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:437)
at org.apache.pig.Main.main(Main.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.ClassNotFoundException: org.python.core.PyException
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 14 more
What do you need to do to use Python UDFs for Pig in Elastic MapReduce?
After quite a few wrong turns, I found that, at least on the elastic map reduce implementation of Hadoop, Pig seems to ignore the CLASSPATH environment variable. I found instead that I could control the class path using the HADOOP_CLASSPATH variable instead.
Once I made that realization, it was fairly easy to get things setup to use Python UDFS:
sudo apt-get install jython -y -qq
export HADOOP_CLASSPATH=/usr/share/java/jython.jar:/usr/share/maven-repo/org/antlr/antlr-runtime/3.2/antlr-runtime-3.2.jar
sudo mkdir /usr/share/java/cachedir/
sudo chmod a+rw /usr/share/java/cachedir
I should point out that this seems to directly contradict other advice I found while searching for solutions to this problem:
register
statement may be relative or absolute, it doesn't seem to matter.