I'm running an Amazon Elastic MapReduce (EMR) job using Pig. I'm having trouble importing the json or simplejson modules into my Python user defined function (UDF).
Here is my code:
#!/usr/bin/env python
import simplejson as json
@outputSchema('m:map[]')
def flattenJSON(text):
j = json.loads(text)
...
When I try to register the function in Pig I get an error saying "No module named simplejson"
grunt> register 's3://chopperui-emr/code/flattenDict.py' using jython as flatten;
2015-05-31 16:53:43,041 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1121: Python Error. Traceback (most recent call last):
File "/tmp/pig6071834754384533869tmp/flattenDict.py", line 32, in <module>
import simplejson as json
ImportError: No module named simplejson
However, my Amazon AMI includes Python 2.6, which includes json as a standard package (using import json doesn't work either). Also, if I try to install simplejson using pip it says it's already installed (on both master and core nodes).
[hadoop@ip-172-31-46-71 ~]$ pip install simplejson
Requirement already satisfied (use --upgrade to upgrade): simplejson in /usr/local/lib64/python2.6/site-packages
Also, it works fine if I run python interactively from the command line on the master node
[hadoop@ip-172-31-46-71 ~]$ python
Python 2.6.9 (unknown, Apr 1 2015, 18:16:00)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>>
There must be something different about how EMR or Pig is setting up the Python environment, but what?
Pig UDF uses jython, which does not work with simplejson.
You can try: Jyson as Json parser