Search code examples
pythoneclipsemapreducemrjob

How to debug python MapReduce programs written in mrjob from eclipse


I am trying to debug mapreduce jobs written in python's mrjob library using eclipse under Ubuntu. Does anyone have an idea how this could be done?


Solution

  • Debugging MrJobs can be quite a challenge sometimes. My learning curve started from using try... except clauses inside mappers and reducers yielding the exceptions produced (using the traceback module) into the results instead of breaking the job flow. But that first approach was time consuming as many times you have to wait several minutes until the job is done, and in the end, most errors ended up undefined variables, or syntax errors. So then I tried using small test logs to feed the jobs, that reduced significantly the amount of time spent running the jobs to see what the problem was. Another approach was to test the mappers and reducers outside of hadoop, this can be very convenient as you can use pdb for that purpose, and figure out problems quickly.

    Finally, you can also try the suggestion of using MrJob's documentation, then you will find how to run the job locally which comes very handy: http://packages.python.org/mrjob/runners-inline.html