I am following this page to install PySpark in Anaconda on Windows 10. In step #6 for validating PySpark, Python could not be found. I found that this answer initially helped me progress to the point of seeing the PySpark banner. Here is my adaption of the solution in the form of commands issued at the Anaconda prompt (not the Anaconda Powershell prompt):
set PYSPARK_DRIVER_PYTHON=python
set PYSPARK_PYTHON=python
# set PYTHONPATH=C:\Users\<user>\anaconda3\pkgs\pyspark-3.4.0-pyhd8ed1ab_0\site-packages
set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
pyspark
As shown above, the PYTHONPATH had to be modified to match the folder tree in my own installation. Essentially, I searched for a folder in c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0
named site-packages
. I assume that the PySpark version was selected by Conda during installation to satisify package dependencies in the current py39
environment, which contains Python 3.9. I use this version for compatibility with others.
PySpark ran for the 1st time after this, but with many, many errors (see Annex below). As I am new to Python, Anaconda, and PySpark, I find the errors to be confusing to say the least. As shown in the Annex, however, I did get the Spark banner and the Python prompt.
As my very first step to troubleshooting the errors, I tried closing and reopening the Conda prompt window. However, the error from this 2nd run of pyspark
was different -- and equally confusing.
pyspark output from 2nd run:
set PYSPARK_DRIVER_PYTHON=python
set PYSPARK_PYTHON=python
set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
pyspark
] was unexpected at this time.
To trace the cause of this different error message, I searched for the file that is executed when I issue pyspark
. Here are the candidate files:
where pyspark
C:\Users\User.Name\anaconda3\envs\py39\Scripts\pyspark
C:\Users\User.Name\anaconda3\envs\py39\Scripts\pyspark.cmd
I noted that the 1st script pyspark
is a Bash script, so it's not suprising that "] was unexpected at this time." I assumed that the 2nd script pyspark.cmd
is for invocation from Windows's CMD interpreter, of which the Conda prompt is a customization, e.g., by setting certain environment variables. Therefore, I ran pyspark.cmd
, but it generated the same error "] was unexpected at this time." Apart from @echo off
, the only command in pyspark.cmd
is cmd /V /E /C ""%~dp0pyspark2.cmd" %*"
, which is indecipherable to me.
It seems odd that the Bash script pyspark
is set up to run in a Conda environment on Windows. Is this caused by a fundamental nonsensicality in setting the 3 environment variables above prior to running pyspark
?
And why would running pyspark.cmd
generate the same error as running the Bash script?
I tracked the 2nd error message down to C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts\pyspark2.cmd
. It is invoked by pyspark.cmd
and also generates the unexpected ]
error:
cd C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts
psypark2.cmd
] was unexpected at this time.
To locate the problematic statement, I manually issued each command in pyspark2.cmd
but did not get the same error. Apart from REM statements, here is pyspark2.cmd
:
REM `C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts\pyspark2.cmd`
REM -------------------------------------------------------------
@echo off
rem Figure out where the Spark framework is installed
call "%~dp0find-spark-home.cmd"
call "%SPARK_HOME%\bin\load-spark-env.cmd"
set _SPARK_CMD_USAGE=Usage: bin\pyspark.cmd [options]
rem Figure out which Python to use.
if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
set PYSPARK_DRIVER_PYTHON=python
if not [%PYSPARK_PYTHON%] == [] set PYSPARK_DRIVER_PYTHON=%PYSPARK_PYTHON%
)
set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.3-src.zip;%PYTHONPATH%
set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py
call "%SPARK_HOME%\bin\spark-submit2.cmd" pyspark-shell-main --name "PySparkShell" %*
Here is my palette of the above commands, slightly modified to account for the fact that they are executing at an interactive prompt rather than from within a script file:
REM ~/tmp/tmp.cmd mirrors pyspark2.cmd
REM ----------------------------------
REM Note that %SPARK_HOME%==
REM "c:\Users\%USERNAME%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages\pyspark"
cd C:\Users\%USERNAME%\anaconda3\envs\py39\Scripts
call "find-spark-home.cmd"
call "%SPARK_HOME%\bin\load-spark-env.cmd"
set _SPARK_CMD_USAGE=Usage: bin\pyspark.cmd [options]
rem Figure out which Python to use.
REM Manually skipped this cuz %PYSPARK_DRIVER_PYTHON%=="python"
if "x%PYSPARK_DRIVER_PYTHON%"=="x" (
set PYSPARK_DRIVER_PYTHON=python
if not [%PYSPARK_PYTHON%] == [] set PYSPARK_DRIVER_PYTHON=%PYSPARK_PYTHON%
)
REM Manually skipped these two cuz they already prefix %PYTHONPATH%
set PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%
set PYTHONPATH=%SPARK_HOME%\python\lib\py4j-0.10.9.3-src.zip;%PYTHONPATH%
set OLD_PYTHONSTARTUP=%PYTHONSTARTUP%
set PYTHONSTARTUP=%SPARK_HOME%\python\pyspark\shell.py
call "%SPARK_HOME%\bin\spark-submit2.cmd" pyspark-shell-main --name "PySparkShell" %*
The final statement above generates the following error:
Error: pyspark does not support any application options.
It is odd that pyspark2.cmd
generates the unexpected ]
error while manually running each statement generates the above "applications options" error.
Over the past week, I have sometimes been able to get the Spark prompt shown in the Annex below. Other times, I get the dreaded ] was unexpected at this time.
It doesn't matter whether or not I start from a virgin Anaconda prompt. For both outcomes (Spark prompt vs. "unexpected ]"), the series of commands are:
(base) C:\Users\User.Name> conda activate py39
(py39) C:\Users\User.Name> set PYSPARK_DRIVER_PYTHON=python
(py39) C:\Users\User.Name> set PYSPARK_PYTHON=python
(py39) C:\Users\User.Name> set PYTHONPATH=c:%HOMEPATH%\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages
(py39) C:\Users\User.Name> pyspark
Due to the unrepeatable outcomes of issuing pyspark
, I returned to troubleshooting by issuing each command in each invoked script. Careful bookkeeping was needed to keep track of the arguments %*
in each script. The order of invocation is:
pyspark.cmd
calls pyspark2.cmd
pyspark2.cmd
calls spark-submit2.cmd
spark-submit2.cmd
executes java
The final java
command is:
(py39) C:\Users\User.Name\anaconda3\envs\py39\Scripts> ^
"%RUNNER%" -Xmx128m ^
-cp "%LAUNCH_CLASSPATH%" org.apache.spark.launcher.Main ^
org.apache.spark.deploy.SparkSubmit pyspark-shell-main ^
--name "PySparkShell" > %LAUNCHER_OUTPUT%
It generates the class-not-found error:
Error: Could not find or load main class org.apache.spark.launcher.Main
Caused by: java.lang.ClassNotFoundException: org.apache.spark.launcher.Main
Here are the environment variables:
%RUNNER% = java
%LAUNCH_CLASSPATH% = c:\Users\User.Name\anaconda3\pkgs\pyspark-3.2.1-py39haa95532_0\Lib\site-packages\pyspark\jars\*
%LAUNCHER_OUTPUT% = C:\Users\User.Name\AppData\Local\Temp\spark-class-launcher-output-22633.txt
The RUNNER variable actually has two trailing spaces, and the quoted "%RUNNER%" invocation causes "java " to be unrecognized, so I removed the quotes.
pyspark
output from 1st run (not 2nd run)(py39) C:\Users\User.Name>pyspark
Python 3.9.17 (main, Jul 5 2023, 21:22:06) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/Users/User.Name/anaconda3/pkgs/pyspark-3.2.1-py39haa95532_0/Lib/site-packages/pyspark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
23/07/07 17:49:58 WARN Shell: Did not find winutils.exe: {}
java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:548)
at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:569)
at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:592)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:689)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at org.apache.hadoop.conf.Configuration.getTimeDurationHelper(Configuration.java:1886)
at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1846)
at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1819)
at org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183)
util.ShutdownHookManager$HookEntry.<init>(ShutdownHookManager.java:207)
at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:304)
at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:181)
at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50)
at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
at org.apache.spark.util.ShutdownHookManager$.<init>(ShutdownHookManager.scala:58)
at org.apache.spark.util.ShutdownHookManager$.<clinit>(ShutdownHookManager.scala)
at org.apache.spark.util.Utils$.createTempDir(Utils.scala:335)
at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:344)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.
at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:468)
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:439)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:516)
... 22 more
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/07 17:50:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.2.1
/_/
Using Python version 3.9.17 (main, Jul 5 2023 21:22:06)
Spark context Web UI available at http://HOST-NAME:4040
Spark context available as 'sc' (master = local[*], app id = local-1688766602995).
SparkSession available as 'spark'.
>>> 23/07/07 17:50:17 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
Some of these messages may be innocuous. I found some of them also at this page about installing PySpark in Anaconda (specifically step 4, "Test Spark Installation"):
After the passage of time and switching to another location and Wi-Fi network I go the following further messages:
23/07/07 19:25:30 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1005)
at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 12 more
23/07/07 19:25:40 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1005)
at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 12 more
23/07/07 19:25:50 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1005)
at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 12 more
23/07/07 19:26:00 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10000 milliseconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1005)
at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:212)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2019)
at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:293)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 12 more
23/07/07 19:26:05 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
23/07/07 19:26:05 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
23/07/07 19:26:05 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
23/07/07 19:26:05 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
FlyingTeller's answer was definitely "a must know" in order to get PySpark working. For the error message that is the subject line of this question, however, the cause is more insidious and obscure (so much so that it's no wonder I haven't found anything about it online).
I found the cause while trying to follow up on FlyingTeller's helpful
advice on installing Hadoop and WinUtils. After obtaining these from
the GitHub site that he recommended, I had to set the following three
variables at the Conda shell prompt, then issue pyspark
:
> set PYSPARK_DRIVER_PYTHON=python
> set PYSPARK_PYTHON=python
> set HADOOP_HOME=c:\Users\User.Name\AppData\Local\Hadoop\2.7.1
> pyspark
The 2.7.1
folder above contains the bin
folder that contains winutils.exe
. I had captured the exact text above in a manual journal text file,
including indents and ">" prompts. I used Vim's blockwise Visual
mode (Ctrl+V) to select the four lines into the system clipboard,
starting from the first "s". Blockwise Visual mode causes the shorter
lines to be right padded with spaces so that they are the same length
as the longest (3rd) line. All these trailing spaces become part of
the string that is assigned to PYSPARK_DRIVER_PYTHON and
PYSPARK_PYTHON above. Somewhere the nesting of scripts, these
trailing spaces result in a series of (likely unintended) invocations
that end in the error message "] was unexpected at this time".
The error doesn't occur when the variables are set to the strings without trailing spaces. I had no idea to even look at the possibility of trailing spaces as the cause for the error, since I'm more used to Bash, where trailing spaces are ignored in variable assignments.
In Vim, Use Visual mode: In Vim, trailing spaces can be avoided in the selection process by using Visual mode (V) instead of blockwise Visual mode. In selecting multiple lines, however, that would capture the prompt symbols ">" and indentations that prefix each line. To avoid this, I would have capture commands in my text file Journal without the ">" prompts. Thus, context is sacrificed, which can be confusing when showing a series of commands across different environment.
In the above code, for example, initial commands issued at the CMD/Conda prompt culminate in the invocation of PySpark, which might be followed by commands issued at a resulting PySpark/Python prompt. Capturing the prompt character with the series of commands to be issued gives a clear indication that the interpreter changes partway through the list of commands.
Copy each command separately, avoiding trailing spaces: Another alternative to using Vim's blockwise Visual mode to copy commands is to Vim-"yank" each line into the system clipboard, from the 1st letter to the end of the line. It is done for each line individually, so it quickly gets tedious.
Use quotes around CMD variable assignments: The best solution that found was to use quotes around the variable assignments so that trailing spaces are ignored:
> set "PYSPARK_DRIVER_PYTHON=python"
> set "PYSPARK_PYTHON=python"
> set "HADOOP_HOME=c:\Users\User.Name\AppDta\Local\Hadoop\2.7.1"
> pyspark
This way, Vim's blockwise Visual mode can still be used to copy multiple lines into the clipboard while avoiding prompt characters and indentation spaces.
WARN NativeCodeLoader: Unable to load native-hadoop library The following links indicate that this is an innocuous warning, with the perplexing caveat that I/O on Windows is only guaranteed to be correct if things are built from scratch. As I am not a developer, I felt that I wasn't prepared to take that on.
Apparently, warnings can be suppressed via log4j, but I didn't explore that.
The exception for non-existence of "file:/tmp/spark-events" To solve this, I created the following folders:
I then created the following file:
# %SPARK_HOME%/conf/spark-defaults.conf
#--------------------------------------
spark.eventLog.enabled true
spark.eventLog.dir C:\\User\\User.Name\\anaconda3\\envs\\py39\\PySparkLogs
park.history.fs.logDirectory C:\\User\\User.Name\\anaconda3\\envs\\py39\\PySparkLogs
Sources of information from which the above solution was synthesized:
Note that %SPARK_HOME%
was not set at the Conda prompt. I relied on
partially successful attempts at getting to the PySpark prompt in the
past few weeks, then querying and recording the environment variable
SPARK_HOME
:
>>> print(os. environ. get("SPARK_HOME"))
C:\Users\User.Name\anaconda3\envs\py39\lib\site-packages\pyspark
I definitely didn't want to be explicitly setting unnecessary
variables because it might be incompatible with how package
installers would set them. FlyingTeller's correction on my
setting of PYTHONPATH
made this abundantly clear.
Cannot run program ... winutils.exe
: Access denied
While the above eliminated the exception for the non-existence of
"file:/tmp/spark-events", it didn't succeed in entering PySpark,
apparently due to error in initializing "SparkContext", in turn due
to inability to run wintuils.exe
. The cause is access denial, but to
what was unclear:
23/07/28 14:01:12 ERROR SparkContext: Error initializing SparkContext.
java.io.IOException: Cannot run program "C:\Users\User.Name\AppData\Local\Hadoop\2.7.1\bin\winutils.exe": CreateProcess error=5, Access is denied
Using Cygwin's Bash, I found that none of the files
c:\Users\User.Name\AppData\Local\Hadoop\2.7.1\bin\*.exe
had execute
permission, so fix that. This eliminated all of the errors and
warnings that had me concerned:
# Using Cygwin's Bash
$ chmod u+x /c/Users/User.Name/AppData/Local/Hadoop/2.7.1/bin/*.exe
REM At the Conda prompt
(py39) C:\Users\User.Name> set "PYSPARK_DRIVER_PYTHON=python"
(py39) C:\Users\User.Name> set "PYSPARK_PYTHON=python"
(py39) C:\Users\User.Name> set "HADOOP_HOME=c:%HOMEPATH%\AppData\Local\Hadoop\2.7.1"
(py39) C:\Users\User.Name> pyspark
Python 3.9.17 (main, Jul 5 2023, 21:22:06) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/Users/User.Name/anaconda3/envs/py39/Lib/site-packages/pyspark/jars/spark-unsafe_2.12-3.2.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/07/28 15:05:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.2.1
/_/
Using Python version 3.9.17 (main, Jul 5 2023 21:22:06)
Spark context Web UI available at http://Laptop-Hostname:4040
Spark context available as 'sc' (master = local[*], app id = local-1690571121168).
SparkSession available as 'spark'.
>>> 23/07/28 15:05:35 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
>>>
ProcfsMetricsGetter
: Exception when trying to compute pagesize
This is the final line in the output immediately above. On Windows,
it is "expected behaviour" (as they say). It is innocuous, according
to Wing Yew Poon
(who I would probably know if I was a developer) and quoted
here in this
answer.