Search code examples
pythonapache-sparkpysparkpipenv

Disable PySpark to print info when running


I have started to use PySpark. Version of PySpark is 3.5.4 and it's installed via pip.

This is my code:

from pyspark.sql import SparkSession
pyspark = SparkSession.builder.master("local[8]").appName("test").getOrCreate()
df = pyspark.read.csv("test.csv", header=True)

print(df.show())

Every time I run the program using:

python test_01.py

It prints all this info about pyspark(in yellow):

enter image description here

How to disable it, so it will not print it.


Solution

    1. Different lines are coming from different sources.
      • Windows ("SUCCESS: ..."),
      • spark launcher shell/batch scripts (":: loading settings ::...")
      • core spark code logging using log4j2
      • core spark code printing using System.out.println()
    2. Different lines are written to different fds (std-out, std-error, log4j log file)
    3. Spark offers different "scripts" (pyspark, spark-submit, spark-shell, ...) for different purposes. You're probably using the wrong one here.

    It's very tedious to pick and choose to disable lines from specific sources going to specific fds. Easiest of course is to control the core logs using log4j2, which can be done as described in wiltonsr's answer or a little more detailed here.


    Based on what you're looking to do, simplest is to use spark-submit, which is meant to be used for headless execution:

    CMD> cat test.py
    from pyspark.sql import SparkSession
    spark = SparkSession.builder \
        .config('spark.jars.packages', 'io.delta:delta-core_2.12:2.4.0') \  # just to produce logs
        .getOrCreate()
    
    spark.createDataFrame(data=[(i,) for i in range(5)], schema='id: int').show()
    
    CMD> spark-submit test.py
    +---+
    | id|
    +---+
    |  0|
    |  1|
    |  2|
    |  3|
    |  4|
    +---+
    
    
    CMD>
    

    To understand who is writing what to which fd is a tedious process, it might even change with platform (Linux/Windows/Mac). I would not recommend it. But if you really want, here are a few hints:

    1. From your original code:

    print(df.show())

    • df.show() prints df to stdout and returns None.
    • print(df.show()) prints None to stdout.
    1. Running using python instead of spark-submit:
    CMD> python test.py
    :: loading settings :: url = jar:file:/C:/My/.venv/Lib/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
    Ivy Default Cache set to: C:\Users\e679994\.ivy2\cache
    The jars for the packages stored in: C:\Users\e679994\.ivy2\jars
    io.delta#delta-core_2.12 added as a dependency
    :: resolving dependencies :: org.apache.spark#spark-submit-parent-499a6ac1-b961-44da-af58-de97e4357cbf;1.0
            confs: [default]
            found io.delta#delta-core_2.12;2.4.0 in central
            found io.delta#delta-storage;2.4.0 in central
            found org.antlr#antlr4-runtime;4.9.3 in central
    :: resolution report :: resolve 171ms :: artifacts dl 8ms
            :: modules in use:
            io.delta#delta-core_2.12;2.4.0 from central in [default]
            io.delta#delta-storage;2.4.0 from central in [default]
            org.antlr#antlr4-runtime;4.9.3 from central in [default]
            ---------------------------------------------------------------------
            |                  |            modules            ||   artifacts   |
            |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
            ---------------------------------------------------------------------
            |      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
            ---------------------------------------------------------------------
    :: retrieving :: org.apache.spark#spark-submit-parent-499a6ac1-b961-44da-af58-de97e4357cbf
            confs: [default]
            0 artifacts copied, 3 already retrieved (0kB/7ms)
    +---+
    | id|
    +---+
    |  0|
    |  1|
    |  2|
    |  3|
    |  4|
    +---+
    
    
    CMD> SUCCESS: The process with PID 38136 (child process of PID 38196) has been terminated.
    SUCCESS: The process with PID 38196 (child process of PID 35316) has been terminated.
    SUCCESS: The process with PID 35316 (child process of PID 22336) has been terminated.
    
    CMD>
    
    1. Redirecting stdout (fd=1) to a file:
    CMD> python test.py > out.txt 2> err.txt
    
    CMD> 
    CMD> cat out.txt
    :: loading settings :: url = jar:file:/C:/My/.venv/Lib/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
    +---+
    | id|
    +---+
    |  0|
    |  1|
    |  2|
    |  3|
    |  4|
    +---+
    
    SUCCESS: The process with PID 25080 (child process of PID 38032) has been terminated.
    SUCCESS: The process with PID 38032 (child process of PID 21176) has been terminated.
    SUCCESS: The process with PID 21176 (child process of PID 38148) has been terminated.
    SUCCESS: The process with PID 38148 (child process of PID 32456) has been terminated.
    SUCCESS: The process with PID 32456 (child process of PID 31656) has been terminated.
    
    CMD> 
    
    1. Redirecting stderr (fd=2) to a file:
    CMD> cat err.txt
    Ivy Default Cache set to: C:\Users\kash\.ivy2\cache
    The jars for the packages stored in: C:\Users\kash\.ivy2\jars
    io.delta#delta-core_2.12 added as a dependency
    :: resolving dependencies :: org.apache.spark#spark-submit-parent-597f3c82-718d-498b-b00e-7928264c307a;1.0
            confs: [default]
            found io.delta#delta-core_2.12;2.4.0 in central
            found io.delta#delta-storage;2.4.0 in central
            found org.antlr#antlr4-runtime;4.9.3 in central
    :: resolution report :: resolve 111ms :: artifacts dl 5ms
            :: modules in use:
            io.delta#delta-core_2.12;2.4.0 from central in [default]
            io.delta#delta-storage;2.4.0 from central in [default]
            org.antlr#antlr4-runtime;4.9.3 from central in [default]
            ---------------------------------------------------------------------
            |                  |            modules            ||   artifacts   |
            |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
            ---------------------------------------------------------------------
            |      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
            ---------------------------------------------------------------------
    :: retrieving :: org.apache.spark#spark-submit-parent-597f3c82-718d-498b-b00e-7928264c307a
            confs: [default]
            0 artifacts copied, 3 already retrieved (0kB/5ms)
    
    CMD> 
    
    1. SUCCESS: The process with PID
      • Note how this is printed AFTER CMD>. I.e. it's printed by "Windows" after it completes execution of python
      • You won't see it on Linux. E.g. from my linux box:
    kash@ub$ python test.py
    19:15:50.037 [main] WARN  org.apache.spark.util.Utils - Your hostname, ub resolves to a loopback address: 127.0.1.1; using 192.168.177.129 instead (on interface ens33)
    19:15:50.049 [main] WARN  org.apache.spark.util.Utils - Set SPARK_LOCAL_IP if you need to bind to another address
    :: loading settings :: url = jar:file:/home/kash/workspaces/spark-log-test/.venv/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
    Ivy Default Cache set to: /home/kash/.ivy2/cache
    The jars for the packages stored in: /home/kash/.ivy2/jars
    io.delta#delta-core_2.12 added as a dependency
    :: resolving dependencies :: org.apache.spark#spark-submit-parent-7d38e7a2-a0e5-47fa-bfda-2cb5b8b443e0;1.0
        confs: [default]
        found io.delta#delta-core_2.12;2.4.0 in spark-list
        found io.delta#delta-storage;2.4.0 in spark-list
        found org.antlr#antlr4-runtime;4.9.3 in spark-list
    :: resolution report :: resolve 390ms :: artifacts dl 10ms
        :: modules in use:
        io.delta#delta-core_2.12;2.4.0 from spark-list in [default]
        io.delta#delta-storage;2.4.0 from spark-list in [default]
        org.antlr#antlr4-runtime;4.9.3 from spark-list in [default]
        ---------------------------------------------------------------------
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ---------------------------------------------------------------------
        |      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
        ---------------------------------------------------------------------
    :: retrieving :: org.apache.spark#spark-submit-parent-7d38e7a2-a0e5-47fa-bfda-2cb5b8b443e0
        confs: [default]
        0 artifacts copied, 3 already retrieved (0kB/19ms)
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    +---+                                                                           
    | id|
    +---+
    |  0|
    |  1|
    |  2|
    |  3|
    |  4|
    +---+
    
    kash@ub$
    

    Anyway that I can disable that on Windows 10/11

    This part SUCCESS: The process with PID 5552 (child process of PID 4668) has been terminated. ... When you run it with python. –IGRACH

    Would not recommend.

    Seems like it's coming from java_gateway.py. You can add stdout=PIPE to the Popen call in your local installation and the output of taskkill will be suppressed.

    if on_windows:
        # In Windows, the child process here is "spark-submit.cmd", not the JVM itself
        # (because the UNIX "exec" command is not available). This means we cannot simply
        # call proc.kill(), which kills only the "spark-submit.cmd" process but not the
        # JVMs. Instead, we use "taskkill" with the tree-kill option "/t" to terminate all
        # child processes in the tree (http://technet.microsoft.com/en-us/library/bb491009.aspx)
        def killChild():
            Popen(["cmd", "/c", "taskkill", "/f", "/t", "/pid", str(proc.pid)])
    

    Change last line to:

    Popen(["cmd", "/c", "taskkill", "/f", "/t", "/pid", str(proc.pid)], stdout=PIPE)