apache-spark pyspark hadoop-yarn google-cloud-dataproc spark-submit

Can run code in pyspark shell but the same code fails when submitted with spark-submit

I am a spark amateur as you will notice in the question. I am trying to run very basic code on a spark cluster. (created on dataproc)

I SSH into the master

Create a pyspark shell with pyspark --master yarn and run the code - Success
Run the exact same code with spark-submit --master yarn code.py - Fails

I have provided some basic details below. Please do let me know whatever additional details I might provide for you to help me.

Details:

code to be run :

testing_dep.py

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
rdd = sc.parallelize(range(100),numSlices=10).collect()
print(rdd)

Running with pyspark shell

pyspark --master yarn

output:

Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:20:46) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-01-07 20:45:54,608 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using
 builtin-java classes where applicable
2022-01-07 20:45:58,195 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used
 because libhadoop cannot be loaded.
2022-01-07 20:45:58,357 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to up
loading libraries under SPARK_HOME.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0
      /_/
Using Python version 3.9.7 (default, Sep 29 2021 19:20:46)
Spark context Web UI available at http://pyspark32-m.us-central1-b.c.monsoon-credittech.internal:4040
Spark context available as 'sc' (master = yarn, app id = application_1641410203571_0040).
SparkSession available as 'spark'.

>>> import pyspark
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> sc = spark.sparkContext
>>> rdd = sc.parallelize(range(100),numSlices=10).collect()
>>> print(rdd)                                                                  
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 
31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59,
 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88
, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

Running with spark-submit

spark-submit --master yarn gs://monsoon-credittech.appspot.com
/testing_dep.py

output:

2022-01-07 20:48:37,310 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using
 builtin-java classes where applicable
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.had
oop.fs.gcs.GoogleHadoopFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2667)
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3431)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3466)
        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
        at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1938)
        at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:780)
        at org.apache.spark.util.DependencyUtils$.downloadFile(DependencyUtils.scala:264)
        at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$8(SparkSubmit.scala:376)
        at scala.Option.map(Option.scala:230)
        at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:376)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:898)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
        at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2571)
        at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2665)
        ... 19 more

Solution

I think the error message is clear:

Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

You need to add the Jar file which contains the above class to SPARK_CLASSPATH

Please see Issues Google Cloud Storage connector on Spark or DataProc for complete solutions.