Search code examples
javascalaapache-sparkpysparkjupyter-notebook

Clustered Spark fails to write _delta_log via a Notebook without granting the Notebook data access?


TLDR: Why does my Spark cluster fail to complete writes to a Delta table unless my Jupyter Notebook has access to the data location, contrary to my expectation that Spark should handle writes independently of Jupyter's data access?

I've set up a PySpark Jupyter Notebook connected to a Spark cluster, where the Spark instance is intended to perform writes to a Delta table. However, I'm observing that the Spark instance fails to complete the writes if the Jupyter Notebook doesn't have access to the data location. Repo for reproducibility. Specific PR that reproduces the bug.

Setup:

version: '3'
services:
  spark:
    image: com/data_lake_spark:latest
    # Spark service configuration details...

  spark-worker-1:
    # Configuration details...

  spark-worker-2:
    # Configuration details...

  jupyter:
    image: com/data_lake_notebook:latest
    # Jupyter Notebook service configuration details...

Spark Session Configuration:

# Spark session setup...

Commanding Code:

# Write initial test data to Delta table
owner_df.write.format("delta").mode("overwrite").save(delta_output_path)

Removing Jupyter's access to the /data directory in the Docker Compose configuration results in a DeltaIOException when attempting to write to the Delta table. However, providing access to the /data directory allows successful writes.

Error Message:

Py4JJavaError: An error occurred while calling o56.save.
: org.apache.spark.sql.delta.DeltaIOException: [DELTA_CANNOT_CREATE_LOG_PATH] Cannot create file:/data/delta_table_of_dog_owners/_delta_log
    at org.apache.spark.sql.delta.DeltaErrorsBase.cannotCreateLogPathException(DeltaErrors.scala:1534)
    at org.apache.spark.sql.delta.DeltaErrorsBase.cannotCreateLogPathException$(DeltaErrors.scala:1533)
    at org.apache.spark.sql.delta.DeltaErrors$.cannotCreateLogPathException(DeltaErrors.scala:3203)
    at org.apache.spark.sql.delta.DeltaLog.createDirIfNotExists$1(DeltaLog.scala:443)

I expect Spark to handle writes independently of Jupyter's data access. Seeking insights or suggestions for resolving this issue. Any guidance would be appreciated.


Solution

  • For the collective benefit:

    The assumption was PySpark was running in a way that it operates independently of the cluster. The reality was that PySpark was running in client or driver mode. What Driver mode basically does is run alongside the PySpark master instance and drives the application. This is so you can do more integrated things that a job submitted to the cluster can't, such as live de-bugging, as you would in a Jupyter Notebook.

    Now, there's a hard coded exception that's the next piece of the puzzle: Exception in thread "main" org.apache.spark.SparkException: Cluster deploy mode is currently not supported for python applications on standalone clusters.

    So you literally can't run PySpark applications in any other way but Client mode against an external cluster.

    How do we solve this problem. Well. This is why interfaces to the cluster like Livy basically exist, and why Databricks's alternative connectors probably got so popular. You can only submit .py files to the cluster from PySpark, without granting PySpark access to the target (as it has to run in Client Mode).