Search code examples
pysparkspark-checkpoint

pyspark checkpoint fails on local machine


I've just started learning pyspark using standalone on local machine. I can't get the checkpoint to work. I boiled down the script to this....

spark = SparkSession.builder.appName("PyTest").master("local[*]").getOrCreate()

spark.sparkContext.setCheckpointDir("/RddCheckPoint")
df = spark.createDataFrame(["10","11","13"], "string").toDF("age")
df.checkpoint()

and I get this output...

>>> spark = SparkSession.builder.appName("PyTest").master("local[*]").getOrCreate()
>>>
>>> spark.sparkContext.setCheckpointDir("/RddCheckPoint")
>>> df = spark.createDataFrame(["10","11","13"], "string").toDF("age")
>>> df.checkpoint()
20/01/24 15:26:45 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "N:\spark\python\pyspark\sql\dataframe.py", line 463, in checkpoint
    jdf = self._jdf.checkpoint(eager)
  File "N:\spark\python\lib\py4j-0.10.8.1-src.zip\py4j\java_gateway.py", line 1286, in __call__
  File "N:\spark\python\pyspark\sql\utils.py", line 98, in deco
    return f(*a, **kw)
  File "N:\spark\python\lib\py4j-0.10.8.1-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o71.checkpoint.
: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:645)
        at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1230)
        at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1435)
        at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:493)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
        at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
        at org.apache.hadoop.fs.ChecksumFileSystem.listStatus(ChecksumFileSystem.java:678)
        at org.apache.spark.rdd.ReliableCheckpointRDD.getPartitions(ReliableCheckpointRDD.scala:74)
        at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
        at org.apache.spark.rdd.ReliableCheckpointRDD$.writeRDDToCheckpointDirectory(ReliableCheckpointRDD.scala:179)
        at org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:59)
        at org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:75)
        at org.apache.spark.rdd.RDD.$anonfun$doCheckpoint$1(RDD.scala:1801)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1791)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2118)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2137)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2156)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2181)
        at org.apache.spark.rdd.RDD.count(RDD.scala:1227)
        at org.apache.spark.sql.Dataset.$anonfun$checkpoint$1(Dataset.scala:689)
        at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3472)
        at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100)
        at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3468)
        at org.apache.spark.sql.Dataset.checkpoint(Dataset.scala:680)
        at org.apache.spark.sql.Dataset.checkpoint(Dataset.scala:643)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.lang.reflect.Method.invoke(Unknown Source)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:238)
        at java.lang.Thread.run(Unknown Source)

The error doesn't give any specifics about why it failed. I suspect I have missed some spark config but not sure what...


Solution

  • You have this error because either is not created the checkpoint directory or you don't have permissions to write in this directory (because the checkpoint directory is under the root directory "/").

    import os
    
    os.mkdir("RddCheckPoint")
    spark = SparkSession.builder.appName("PyTest").master("local[*]").getOrCreate()
    
    spark.sparkContext.setCheckpointDir("RddCheckPoint")
    df = spark.createDataFrame(["10","11","13"], "string").toDF("age")
    df.checkpoint()