Search code examples
scalaapache-sparkhadoop2google-cloud-dataproc

How to config gcs-connector in local environment properly


I'm trying to config gcs-connector in my scala project but I always get java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

Here is my project config:

val sparkConf = new SparkConf()
      .set("spark.executor.memory", "4g")
      .set("spark.executor.cores", "2")
      .set("spark.driver.memory", "4g")
      .set("temporaryGcsBucket", "some-bucket")

    val spark = SparkSession.builder()
      .config(sparkConf)
      .master("spark://spark-master:7077")
      .getOrCreate()

    val hadoopConfig = spark.sparkContext.hadoopConfiguration
    hadoopConfig.set("fs.gs.auth.service.account.enable", "true")
    hadoopConfig.set("fs.gs.auth.service.account.json.keyfile", "./path-to-key-file.json")
    hadoopConfig.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
    hadoopConfig.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

I tried to set gcs-connector using both:

.set("spark.jars.packages", "com.google.cloud.bigdataoss:gcs-connector:hadoop2-2.1.6")
.set("spark.driver.extraClassPath", ":/home/celsomarques/Desktop/gcs-connector-hadoop2-2.1.6.jar")

But neither of them load the specified class to classpath.

Could you point me what I'm doing wrong, please?


Solution

  • The following config worked:

    val sparkConf = new SparkConf()
          .set("spark.executor.memory", "4g")
          .set("spark.executor.cores", "2")
          .set("spark.driver.memory", "4g")
    
        val spark = SparkSession.builder()
          .config(sparkConf)
          .master("local")
          .getOrCreate()