Search code examples
pythondebuggingpysparkipdb

Debugging pyspark in ipdb-fashion


When developing python code, I make use of the package ipdb.

This halts the execution of the python code there, where I have inserted ipdb.set_trace(), and presents me with a python interpreter command line.

However, in the python code that I develop for pyspark, and which I send off using spark-submit, the ipdb package does not work.

So my question is: is there a way, in which I can debug my pyspark code in a manner similar to using the ipdb package?

Note: Obviously, for python code executed on remote nodes, this would not be possible. But when using spark-submit with option --master local[1] I have hopes that it might be possible.

PS. There is a related question, but with a narrower scope, here: How to PySpark Codes in Debug Jupyter Notebook


Solution

  • PYSPARK_DRIVER_PYTHON=ipython pyspark
    
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 2.4.3
          /_/
    
    Using Python version 3.7.1 (default, Jun 16 2019 23:56:28)
    SparkSession available as 'spark'.
    
    In [1]: sc.stop()
    
    In [2]: run -d main.py
    Breakpoint 1 at /Users/andrii/work/demo/main.py:1
    NOTE: Enter 'c' at the ipdb>  prompt to continue execution.
    > /Users/andrii/work/demo/main.py(1)<module>()
    1---> 1 print(123)
          2 import ipdb;ipdb.set_trace()
          3 a = 2
          4 b = 3
    

    or

    In [3]: run main.py
    123
    > /Users/andrii/work/demo/main.py(3)<module>()
          2 import ipdb;ipdb.set_trace()
    ----> 3 a = 2
          4 b = 3