I am spinning up on Python, Spark, and Pyspark, all installed with
Anaconda on Windows 10. According to this
tutorial,
I should be able to read all CSV files in a folder into a DataFrame
,
providing that all the files in the folder are CSV files:
# Create SparkSession object on stand-alone laptop.
# Intel Core i5-1245U has 10 cores.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[10]") \
.appName("SparkExamples.com").getOrCreate()
df = spark.read.csv( r"C:\cygwin64\home\User.Name\tmp", header=True )
Here, folder C:\cygwin64\home\User.Name\tmp
contains only
zipcodes1.csv
, zipcodes2.csv
, and zipcodes3.csv
. Each of these
is a replica of zipcodes.csv
from
GitHub.
A DataFrame
created from one of the files looks like:
>>> spark.read.option("header","true") \
.csv(r"C:\cygwin64\home\User.Name\tmp\zipcodes1.csv").show(3)
+------------+-------+-----------+-------------------+-----+<...snip...>
|RecordNumber|Zipcode|ZipCodeType| City|State|<...snip...>
+------------+-------+-----------+-------------------+-----+<...snip...>
| 1| 704| STANDARD| PARC PARQUE| PR|<...snip...>
| 2| 704| STANDARD|PASEO COSTA DEL SUR| PR|<...snip...>
| 10| 709| STANDARD| BDA SAN LUIS| PR|<...snip...>
+------------+-------+-----------+-------------------+-----+<...snip...>
While I can read one file, I can't read the whole tmp
folder. The
error is nondescript. The Spyder console transcript is:
>>> spark.read.option("header","true") \
.csv( r"C:\cygwin64\home\User.Name\tmp").show(3)
Traceback (most recent call last):
Cell In[19], line 1
spark.read.option("header","true") \
File ~\anaconda3\envs\py39\lib\site-packages\pyspark\sql\readwriter.py:727 in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File ~\anaconda3\envs\py39\lib\site-packages\py4j\java_gateway.py:1322 in __call__
return_value = get_return_value(
File ~\anaconda3\envs\py39\lib\site-packages\pyspark\errors\exceptions\captured.py:169 in deco
return f(*a, **kw)
File ~\anaconda3\envs\py39\lib\site-packages\py4j\protocol.py:326 in get_return_value
raise Py4JJavaError(
Py4JJavaError: An error occurred while calling o87.csv.
: java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793)
<...snip...>
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
The Annex B below provides the full stack trace. It is always the same
except for the Py4JJavaError
line, which specifies a different
o___.csv
file each time. The path prefix ~
is Windows's
%USERPROFILE%
.
Much internet searching reveals that the code pattern should work. What am I doing wrong?
Troubleshooting #1
I confirmed that I can read zipcodes[123].csv
if paths are
specified individually. From trial-and-error, I found that I must
collect the individual paths into a list (described
here):
>>> spark.read \
.option("header","true") \
.csv([r"C:\cygwin64\home\User.Name\tmp\zipcodes1.csv",
r"C:\cygwin64\home\User.Name\tmp\zipcodes2.csv",
r"C:\cygwin64\home\User.Name\tmp\zipcodes3.csv"]
).show(3)
+------------+-------+-----------+-------------------+-----+<...snip...>
|RecordNumber|Zipcode|ZipCodeType| City|State|<...snip...>
+------------+-------+-----------+-------------------+-----+<...snip...>
| 1| 704| STANDARD| PARC PARQUE| PR|<...snip...>
| 2| 704| STANDARD|PASEO COSTA DEL SUR| PR|<...snip...>
| 10| 709| STANDARD| BDA SAN LUIS| PR|<...snip...>
+------------+-------+-----------+-------------------+-----+<...snip...>
Troubleshooting #2
As a shot in the dark, I also tried to get rid of the header
option:
`spark.read.csv( r"C:\cygwin64\home\User.Name\tmp").show(3)`
This generates the same error and stack trace as above.
Troubleshooting #3
Perhaps the path needs to end in a path separator so that Spark knows that it is a directory:
>>> spark.read.csv( r"C:\cygwin64\home\User.Name\tmp\", header=True )
Cell In[1], line 1
spark.read.csv( r"C:\cygwin64\home\User.Name\tmp\", header=True )
^
SyntaxError: EOL while scanning string literal
Obviously, Python blew by the closing quotes. It isn't clear why. It can't be due to the preceding backslash because backslashes are not special in raw strings.
Troubleshooting #4
Being new to Python, I could be wrong about the nonspecialness of backslashes in raw strings, perhaps under special conditions, for reasons that I couldn't yet fathom, e.g., when a backslash is the final character in a string. Therefore, I tried escaping the terminating backslash:
spark.read.csv( r"C:\cygwin64\home\User.Name\tmp\\", header=True )
This generated the same errors and stack trace as above, except that
the Py4JJavaError
line refers to o56.csv
rather than o87.csv
.
While this didn't solve the original problem, it did reveal a mystery to me: Why a terminating backslash seems to be special in a raw string.
Troubleshooting #5
I found that indeed, backslashes aren't normal characters in raw strings (see here, here, and here). The right way to specify a path with a terminating backslash is:
spark.read.csv( r"C:\cygwin64\home\User.Name\tmp" "\\", header=True )
spark.read \
.option("header","true") \
.csv(r"C:\cygwin64\home\User.Name\tmp" "\\") \
.show(3)
Both these statements yield the same error and stack trace as above,
however, except that the Py4JJavaError
line refers to o56.csv
and
o61.csv
, respectively
Troubleshooting #6
To get around any uncertainty arising from how backslashes are interpreted in string literals, I used forward slashes instead:
spark.read.option("header","true") \
.csv(r"C:/cygwin64/home/User.Name/tmp/")
Again, this yields the same error and stack trace, except that
the Py4JJavaError
line refers to o48.csv
.
Troubleshooting #7
In response to user238607's answer, I tried:
folder_path = "C:/cygwin64/home/User.Name/tmp/"
glob_pattern = "*.csv" # Example: Read only CSV files
data_frame = spark.read.option("pathGlobFilter", glob_pattern) \
.option("header", True).csv(folder_path)
This yields the same error and stack grace, except that the
Py4JJavaError line refers to o48.csv
.
Troubleshooting #8
Perhaps it is the letter-drive prefix in the full path (even though
that doesn't cause a problem when specifying individual file paths).
To check, copy the folder tmp
containing CSV-only files into the present directory, which pwd
shows
as 'C:\\Users\\User.Name'
. Using Cygwin's Bash:
$ cp -R ~/tmp /c/Users/User.Name
$ ls -ld /c/Users/User.Name/tmp
drwx------+ 1 User.Name None 0 Oct 12 15:21 /c/Users/User.Name/tmp
$ ls -l /c/Users/User.Name/tmp
-rwx------+ 1 User.Name None 3035 Oct 12 15:21 zipcodes1.csv
-rwx------+ 1 User.Name None 3035 Oct 12 15:21 zipcodes2.csv
-rwx------+ 1 User.Name None 3035 Oct 12 15:21 zipcodes3.csv
In Spyder, specify the folder using raw and normal strings:
spark.read.option("header","true").csv(r"tmp/")
spark.read.option("header","true").csv("tmp/")
These two invocations generate the same error and stack trace except
that the Py4JJavaError
line refers to o65.csv
and o70.csv
,
respectively.
Troubleshooting #9
Thanks again to user238607 for pointing out
this
possible cause and solution. However, it looks like I have those
bases covered in my computing setup. Essentially, environment
variable HADOOP_HOME
must be set to the parent folder of Hadoop's
bin
folder. I confirmed this in Spyder:
>>> os.environ.get("HADOOP_HOME")
'C:\\Users\\User.Name\\AppData\\Local\\Hadoop\\2.7.1'
The bin
subfolder therein contains many Hadoop-looking files (Annex
C) and no further subfolders. As I am new to Python, Spark, and
especially Hadoop, is there simple sanity check to confirm the proper
setup? I've already successfully followed SparkByExamples tutorials
in which RDD objects are created.
The full stack trace is always the same except for the Py4JJavaError
line, which
specifies a different o___.csv
file each time.
Traceback (most recent call last):
Cell In[18], line 1
spark.read \
File ~\anaconda3\envs\py39\lib\site-packages\pyspark\sql\readwriter.py:727 in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File ~\anaconda3\envs\py39\lib\site-packages\py4j\java_gateway.py:1322 in __call__
return_value = get_return_value(
File ~\anaconda3\envs\py39\lib\site-packages\pyspark\errors\exceptions\captured.py:169 in deco
return f(*a, **kw)
File ~\anaconda3\envs\py39\lib\site-packages\py4j\protocol.py:326 in get_return_value
raise Py4JJavaError(
Py4JJavaError: An error occurred while calling o65.csv.
: java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793)
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1249)
at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1454)
at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
at org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225)
at org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$1(HadoopFSUtils.scala:95)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFilesInternal(HadoopFSUtils.scala:85)
at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFiles(HadoopFSUtils.scala:69)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:162)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:133)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:96)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:68)
at org.apache.spark.sql.execution.datasources.DataSource.createInMemoryFileIndex(DataSource.scala:539)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:405)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:538)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:829)
OnOutOfMemory.cmd
Start-HadoopAdminShell.cmd
Start-HadoopAdminShell.ps1
datanode.exe
datanode.xml
gplcompression.dll
hadoop
hadoop.cmd
hadoop.dll
hadoop.exp
hadoop.lib
hdfs
hdfs.cmd
hdfs.dll
hdfs.exp
hdfs.lib
hdfs_static.lib
historyserver.exe
historyserver.xml
kill-name-node
kill-secondary-name-node
libwinutils.lib
lzo2.dll
mapred
mapred.cmd
namenode.exe
namenode.xml
nodemanager.exe
nodemanager.xml
rcc
resourcemanager.exe
resourcemanager.xml
secondarynamenode.exe
secondarynamenode.xml
snappy-c.obj
snappy-sinksource.obj
snappy-stubs-internal.obj
snappy.dll
snappy.dll.intermediate.manifest
snappy.exp
snappy.lastbuildstate
snappy.lib
snappy.obj
snappy.write.1.tlog
timelineserver.exe
timelineserver.xml
winutils.exe
yarn
yarn.cmd
Thanks to user238607's comment, the solution was to simply copy hadoop.dll
to c:\Windows\System32
.
@user238607: This answer really belongs to you, so feel free to post it and I'll take down mine. Though with your rep, I doubt that you're that concerned about it. I'm interested in providing an answer promptly so that the community bot doesn't see this Q&A as useless and remove it. Thanks!