Search code examples
csvapache-sparkpyspark

Error from PySpark code pattern for reading all CSV files in a folder


I am spinning up on Python, Spark, and Pyspark, all installed with Anaconda on Windows 10. According to this tutorial, I should be able to read all CSV files in a folder into a DataFrame, providing that all the files in the folder are CSV files:

# Create SparkSession object on stand-alone laptop.
# Intel Core i5-1245U has 10 cores.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[10]") \
   .appName("SparkExamples.com").getOrCreate()

df = spark.read.csv( r"C:\cygwin64\home\User.Name\tmp", header=True )

Here, folder C:\cygwin64\home\User.Name\tmp contains only zipcodes1.csv, zipcodes2.csv, and zipcodes3.csv. Each of these is a replica of zipcodes.csv from GitHub. A DataFrame created from one of the files looks like:

>>> spark.read.option("header","true") \
      .csv(r"C:\cygwin64\home\User.Name\tmp\zipcodes1.csv").show(3)
+------------+-------+-----------+-------------------+-----+<...snip...>
|RecordNumber|Zipcode|ZipCodeType|               City|State|<...snip...>
+------------+-------+-----------+-------------------+-----+<...snip...>
|           1|    704|   STANDARD|        PARC PARQUE|   PR|<...snip...>
|           2|    704|   STANDARD|PASEO COSTA DEL SUR|   PR|<...snip...>
|          10|    709|   STANDARD|       BDA SAN LUIS|   PR|<...snip...>
+------------+-------+-----------+-------------------+-----+<...snip...>

While I can read one file, I can't read the whole tmp folder. The error is nondescript. The Spyder console transcript is:

>>> spark.read.option("header","true") \
   .csv( r"C:\cygwin64\home\User.Name\tmp").show(3)
Traceback (most recent call last):
  Cell In[19], line 1
    spark.read.option("header","true") \
  File ~\anaconda3\envs\py39\lib\site-packages\pyspark\sql\readwriter.py:727 in csv
    return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File ~\anaconda3\envs\py39\lib\site-packages\py4j\java_gateway.py:1322 in __call__
    return_value = get_return_value(
  File ~\anaconda3\envs\py39\lib\site-packages\pyspark\errors\exceptions\captured.py:169 in deco
    return f(*a, **kw)
  File ~\anaconda3\envs\py39\lib\site-packages\py4j\protocol.py:326 in get_return_value
    raise Py4JJavaError(
Py4JJavaError: An error occurred while calling o87.csv.
: java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
   at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
   at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793)
   <...snip...>
   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
   at java.base/java.lang.Thread.run(Thread.java:829)

The Annex B below provides the full stack trace. It is always the same except for the Py4JJavaError line, which specifies a different o___.csv file each time. The path prefix ~ is Windows's %USERPROFILE%.

Much internet searching reveals that the code pattern should work. What am I doing wrong?


Annex A: Troubleshooting

Troubleshooting #1

I confirmed that I can read zipcodes[123].csv if paths are specified individually. From trial-and-error, I found that I must collect the individual paths into a list (described here):

>>> spark.read \
   .option("header","true") \
   .csv([r"C:\cygwin64\home\User.Name\tmp\zipcodes1.csv",
         r"C:\cygwin64\home\User.Name\tmp\zipcodes2.csv",
         r"C:\cygwin64\home\User.Name\tmp\zipcodes3.csv"]
   ).show(3)
+------------+-------+-----------+-------------------+-----+<...snip...>
|RecordNumber|Zipcode|ZipCodeType|               City|State|<...snip...>
+------------+-------+-----------+-------------------+-----+<...snip...>
|           1|    704|   STANDARD|        PARC PARQUE|   PR|<...snip...>
|           2|    704|   STANDARD|PASEO COSTA DEL SUR|   PR|<...snip...>
|          10|    709|   STANDARD|       BDA SAN LUIS|   PR|<...snip...>
+------------+-------+-----------+-------------------+-----+<...snip...>

Troubleshooting #2

As a shot in the dark, I also tried to get rid of the header option:

`spark.read.csv( r"C:\cygwin64\home\User.Name\tmp").show(3)`

This generates the same error and stack trace as above.

Troubleshooting #3

Perhaps the path needs to end in a path separator so that Spark knows that it is a directory:

>>> spark.read.csv( r"C:\cygwin64\home\User.Name\tmp\", header=True )
  Cell In[1], line 1
    spark.read.csv( r"C:\cygwin64\home\User.Name\tmp\", header=True )
                                                                   ^
SyntaxError: EOL while scanning string literal

Obviously, Python blew by the closing quotes. It isn't clear why. It can't be due to the preceding backslash because backslashes are not special in raw strings.

Troubleshooting #4

Being new to Python, I could be wrong about the nonspecialness of backslashes in raw strings, perhaps under special conditions, for reasons that I couldn't yet fathom, e.g., when a backslash is the final character in a string. Therefore, I tried escaping the terminating backslash:

spark.read.csv( r"C:\cygwin64\home\User.Name\tmp\\", header=True )

This generated the same errors and stack trace as above, except that the Py4JJavaError line refers to o56.csv rather than o87.csv.

While this didn't solve the original problem, it did reveal a mystery to me: Why a terminating backslash seems to be special in a raw string.

Troubleshooting #5

I found that indeed, backslashes aren't normal characters in raw strings (see here, here, and here). The right way to specify a path with a terminating backslash is:

spark.read.csv( r"C:\cygwin64\home\User.Name\tmp" "\\", header=True )

spark.read \
   .option("header","true") \
   .csv(r"C:\cygwin64\home\User.Name\tmp" "\\") \
   .show(3)

Both these statements yield the same error and stack trace as above, however, except that the Py4JJavaError line refers to o56.csv and o61.csv, respectively

Troubleshooting #6

To get around any uncertainty arising from how backslashes are interpreted in string literals, I used forward slashes instead:

spark.read.option("header","true") \
   .csv(r"C:/cygwin64/home/User.Name/tmp/")

Again, this yields the same error and stack trace, except that the Py4JJavaError line refers to o48.csv.

Troubleshooting #7

In response to user238607's answer, I tried:

folder_path = "C:/cygwin64/home/User.Name/tmp/"
glob_pattern = "*.csv" # Example: Read only CSV files
data_frame = spark.read.option("pathGlobFilter", glob_pattern) \
  .option("header", True).csv(folder_path)

This yields the same error and stack grace, except that the Py4JJavaError line refers to o48.csv.

Troubleshooting #8

Perhaps it is the letter-drive prefix in the full path (even though that doesn't cause a problem when specifying individual file paths). To check, copy the folder tmp containing CSV-only files into the present directory, which pwd shows as 'C:\\Users\\User.Name'. Using Cygwin's Bash:

$ cp -R ~/tmp /c/Users/User.Name
$ ls -ld /c/Users/User.Name/tmp

   drwx------+ 1 User.Name None 0 Oct 12 15:21 /c/Users/User.Name/tmp

$ ls -l /c/Users/User.Name/tmp

   -rwx------+ 1 User.Name None 3035 Oct 12 15:21 zipcodes1.csv
   -rwx------+ 1 User.Name None 3035 Oct 12 15:21 zipcodes2.csv
   -rwx------+ 1 User.Name None 3035 Oct 12 15:21 zipcodes3.csv

In Spyder, specify the folder using raw and normal strings:

spark.read.option("header","true").csv(r"tmp/")
spark.read.option("header","true").csv("tmp/")

These two invocations generate the same error and stack trace except that the Py4JJavaError line refers to o65.csv and o70.csv, respectively.

Troubleshooting #9

Thanks again to user238607 for pointing out this possible cause and solution. However, it looks like I have those bases covered in my computing setup. Essentially, environment variable HADOOP_HOME must be set to the parent folder of Hadoop's bin folder. I confirmed this in Spyder:

>>> os.environ.get("HADOOP_HOME")
'C:\\Users\\User.Name\\AppData\\Local\\Hadoop\\2.7.1'

The bin subfolder therein contains many Hadoop-looking files (Annex C) and no further subfolders. As I am new to Python, Spark, and especially Hadoop, is there simple sanity check to confirm the proper setup? I've already successfully followed SparkByExamples tutorials in which RDD objects are created.

Annex B: Full strack trace

The full stack trace is always the same except for the Py4JJavaError line, which specifies a different o___.csv file each time.

Traceback (most recent call last):
  Cell In[18], line 1
    spark.read \
  File ~\anaconda3\envs\py39\lib\site-packages\pyspark\sql\readwriter.py:727 in csv
    return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File ~\anaconda3\envs\py39\lib\site-packages\py4j\java_gateway.py:1322 in __call__
    return_value = get_return_value(
  File ~\anaconda3\envs\py39\lib\site-packages\pyspark\errors\exceptions\captured.py:169 in deco
    return f(*a, **kw)
  File ~\anaconda3\envs\py39\lib\site-packages\py4j\protocol.py:326 in get_return_value
    raise Py4JJavaError(
Py4JJavaError: An error occurred while calling o65.csv.
: java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:793)
    at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1249)
    at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1454)
    at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:601)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1972)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:2014)
    at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:761)
    at org.apache.spark.util.HadoopFSUtils$.listLeafFiles(HadoopFSUtils.scala:225)
    at org.apache.spark.util.HadoopFSUtils$.$anonfun$parallelListLeafFilesInternal$1(HadoopFSUtils.scala:95)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at scala.collection.TraversableLike.map(TraversableLike.scala:286)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
    at scala.collection.AbstractTraversable.map(Traversable.scala:108)
    at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFilesInternal(HadoopFSUtils.scala:85)
    at org.apache.spark.util.HadoopFSUtils$.parallelListLeafFiles(HadoopFSUtils.scala:69)
    at org.apache.spark.sql.execution.datasources.InMemoryFileIndex$.bulkListLeafFiles(InMemoryFileIndex.scala:162)
    at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.listLeafFiles(InMemoryFileIndex.scala:133)
    at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.refresh0(InMemoryFileIndex.scala:96)
    at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.<init>(InMemoryFileIndex.scala:68)
    at org.apache.spark.sql.execution.datasources.DataSource.createInMemoryFileIndex(DataSource.scala:539)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:405)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:538)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.base/java.lang.Thread.run(Thread.java:829)

Anenx C: Files in %USERPROFILE%\AppData\Local\Hadoop\2.7.1\bin

OnOutOfMemory.cmd
Start-HadoopAdminShell.cmd
Start-HadoopAdminShell.ps1
datanode.exe
datanode.xml
gplcompression.dll
hadoop
hadoop.cmd
hadoop.dll
hadoop.exp
hadoop.lib
hdfs
hdfs.cmd
hdfs.dll
hdfs.exp
hdfs.lib
hdfs_static.lib
historyserver.exe
historyserver.xml
kill-name-node
kill-secondary-name-node
libwinutils.lib
lzo2.dll
mapred
mapred.cmd
namenode.exe
namenode.xml
nodemanager.exe
nodemanager.xml
rcc
resourcemanager.exe
resourcemanager.xml
secondarynamenode.exe
secondarynamenode.xml
snappy-c.obj
snappy-sinksource.obj
snappy-stubs-internal.obj
snappy.dll
snappy.dll.intermediate.manifest
snappy.exp
snappy.lastbuildstate
snappy.lib
snappy.obj
snappy.write.1.tlog
timelineserver.exe
timelineserver.xml
winutils.exe
yarn
yarn.cmd

Solution

  • Thanks to user238607's comment, the solution was to simply copy hadoop.dll to c:\Windows\System32.

    @user238607: This answer really belongs to you, so feel free to post it and I'll take down mine. Though with your rep, I doubt that you're that concerned about it. I'm interested in providing an answer promptly so that the community bot doesn't see this Q&A as useless and remove it. Thanks!