I have started to use PySpark. Version of PySpark is 3.5.4
and it's installed via pip
.
This is my code:
from pyspark.sql import SparkSession
pyspark = SparkSession.builder.master("local[8]").appName("test").getOrCreate()
df = pyspark.read.csv("test.csv", header=True)
print(df.show())
Every time I run the program using:
python test_01.py
It prints all this info about pyspark(in yellow):
How to disable it, so it will not print it.
System.out.println()
pyspark
, spark-submit
, spark-shell
, ...) for different purposes. You're probably using the wrong one here.It's very tedious to pick and choose to disable lines from specific sources going to specific fds. Easiest of course is to control the core logs using log4j2, which can be done as described in wiltonsr's answer or a little more detailed here.
Based on what you're looking to do, simplest is to use spark-submit
, which is meant to be used for headless execution:
CMD> cat test.py
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config('spark.jars.packages', 'io.delta:delta-core_2.12:2.4.0') \ # just to produce logs
.getOrCreate()
spark.createDataFrame(data=[(i,) for i in range(5)], schema='id: int').show()
CMD> spark-submit test.py
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
CMD>
To understand who is writing what to which fd is a tedious process, it might even change with platform (Linux/Windows/Mac). I would not recommend it. But if you really want, here are a few hints:
print(df.show())
df.show()
prints df
to stdout and returns None
.print(df.show())
prints None
to stdout.python
instead of spark-submit
:CMD> python test.py
:: loading settings :: url = jar:file:/C:/My/.venv/Lib/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: C:\Users\e679994\.ivy2\cache
The jars for the packages stored in: C:\Users\e679994\.ivy2\jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-499a6ac1-b961-44da-af58-de97e4357cbf;1.0
confs: [default]
found io.delta#delta-core_2.12;2.4.0 in central
found io.delta#delta-storage;2.4.0 in central
found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 171ms :: artifacts dl 8ms
:: modules in use:
io.delta#delta-core_2.12;2.4.0 from central in [default]
io.delta#delta-storage;2.4.0 from central in [default]
org.antlr#antlr4-runtime;4.9.3 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 0 | 0 | 0 || 3 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-499a6ac1-b961-44da-af58-de97e4357cbf
confs: [default]
0 artifacts copied, 3 already retrieved (0kB/7ms)
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
CMD> SUCCESS: The process with PID 38136 (child process of PID 38196) has been terminated.
SUCCESS: The process with PID 38196 (child process of PID 35316) has been terminated.
SUCCESS: The process with PID 35316 (child process of PID 22336) has been terminated.
CMD>
stdout
(fd=1) to a file:CMD> python test.py > out.txt 2> err.txt
CMD>
CMD> cat out.txt
:: loading settings :: url = jar:file:/C:/My/.venv/Lib/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
SUCCESS: The process with PID 25080 (child process of PID 38032) has been terminated.
SUCCESS: The process with PID 38032 (child process of PID 21176) has been terminated.
SUCCESS: The process with PID 21176 (child process of PID 38148) has been terminated.
SUCCESS: The process with PID 38148 (child process of PID 32456) has been terminated.
SUCCESS: The process with PID 32456 (child process of PID 31656) has been terminated.
CMD>
stderr
(fd=2) to a file:CMD> cat err.txt
Ivy Default Cache set to: C:\Users\kash\.ivy2\cache
The jars for the packages stored in: C:\Users\kash\.ivy2\jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-597f3c82-718d-498b-b00e-7928264c307a;1.0
confs: [default]
found io.delta#delta-core_2.12;2.4.0 in central
found io.delta#delta-storage;2.4.0 in central
found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 111ms :: artifacts dl 5ms
:: modules in use:
io.delta#delta-core_2.12;2.4.0 from central in [default]
io.delta#delta-storage;2.4.0 from central in [default]
org.antlr#antlr4-runtime;4.9.3 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 0 | 0 | 0 || 3 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-597f3c82-718d-498b-b00e-7928264c307a
confs: [default]
0 artifacts copied, 3 already retrieved (0kB/5ms)
CMD>
SUCCESS: The process with PID
CMD>
. I.e. it's printed by "Windows" after it completes execution of python
kash@ub$ python test.py
19:15:50.037 [main] WARN org.apache.spark.util.Utils - Your hostname, ub resolves to a loopback address: 127.0.1.1; using 192.168.177.129 instead (on interface ens33)
19:15:50.049 [main] WARN org.apache.spark.util.Utils - Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/kash/workspaces/spark-log-test/.venv/lib/python3.9/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/kash/.ivy2/cache
The jars for the packages stored in: /home/kash/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-7d38e7a2-a0e5-47fa-bfda-2cb5b8b443e0;1.0
confs: [default]
found io.delta#delta-core_2.12;2.4.0 in spark-list
found io.delta#delta-storage;2.4.0 in spark-list
found org.antlr#antlr4-runtime;4.9.3 in spark-list
:: resolution report :: resolve 390ms :: artifacts dl 10ms
:: modules in use:
io.delta#delta-core_2.12;2.4.0 from spark-list in [default]
io.delta#delta-storage;2.4.0 from spark-list in [default]
org.antlr#antlr4-runtime;4.9.3 from spark-list in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 0 | 0 | 0 || 3 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-7d38e7a2-a0e5-47fa-bfda-2cb5b8b443e0
confs: [default]
0 artifacts copied, 3 already retrieved (0kB/19ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
kash@ub$
Anyway that I can disable that on Windows 10/11
This part
SUCCESS: The process with PID 5552 (child process of PID 4668) has been terminated. ...
When you run it with python. –IGRACH
Seems like it's coming from java_gateway.py
. You can add stdout=PIPE
to the Popen
call in your local installation and the output of taskkill
will be suppressed.
if on_windows:
# In Windows, the child process here is "spark-submit.cmd", not the JVM itself
# (because the UNIX "exec" command is not available). This means we cannot simply
# call proc.kill(), which kills only the "spark-submit.cmd" process but not the
# JVMs. Instead, we use "taskkill" with the tree-kill option "/t" to terminate all
# child processes in the tree (http://technet.microsoft.com/en-us/library/bb491009.aspx)
def killChild():
Popen(["cmd", "/c", "taskkill", "/f", "/t", "/pid", str(proc.pid)])
Change last line to:
Popen(["cmd", "/c", "taskkill", "/f", "/t", "/pid", str(proc.pid)], stdout=PIPE)