Search code examples
apache-sparkkubernetesnfsdelta-lake

Unable to create file using Spark on Client Mode


I have Spark 3.1.2 running on Client mode on K8S (I have 8 workers). I setup a NFS storage to update a delta file stored on it. My spark is throwing the following error to me:

java.io.IOException: Cannot create file:/spark-nfs/v_data/delta/table_1/_delta_log
 at org.apache.spark.sql.delta.DeltaLog.ensureLogDirectoryExist(DeltaLog.scala:290)

The code that throws the error is:

df.write.partitionBy("Cod").format('delta').save(path="/spark-nfs/v_data/delta/table_1/", mode='overwrite')

My spark config is:

self.conf = {
            "spark.network.timeout": 36000000,
            "spark.executor.heartbeatInterval": 36000000,
            "spark.storage.blockManagerSlaveTimeoutMs": 36000000,
            "spark.driver.maxResultSize": "30g",
            "spark.sql.session.timeZone": "UTC",
            "spark.driver.extraJavaOptions": "-Duser.timezone=GMT",
            "spark.executor.extraJavaOptions": "-Duser.timezone=GMT",
            "spark.driver.host": pod_ip,
            "spark.driver.memory": executor_memory,
            "spark.memory.offHeap.enabled": True,
            "spark.memory.offHeap.size": executor_memory,
            "spark.sql.legacy.parquet.int96RebaseModeInRead" : "CORRECTED",
            "spark.sql.legacy.parquet.int96RebaseModeInWrite" : "CORRECTED",
            "spark.sql.legacy.parquet.datetimeRebaseModeInRead" : "CORRECTED",
            "spark.sql.legacy.parquet.datetimeRebaseModeInWrite" : "CORRECTED",
            "fs.permissions.umask-mode": "777"
        }

I'm using io.delta:delta-core_2.12:1.0.0.

So, since I'm giving full permission, why I can't create the delta log file?

NOTE: Only _delta_log file is not created, the parquet files are normally created wihtin the directory.


Solution

  • Basically, as I using spark in Client Mode, when I started a job (I was starting via Airflow), the node that calls spark (in my case the Airflow worker) is the master node to Spark. So I have to point all nodes (spark, airflow) to write to NFS (I was pointing only spark workers to NFS).