Search code examples
scalaapache-sparkamazon-emr

Spark worker nodes unable to access file on master node


I am trying to connect to Presto DB through my Spark Scala code and running it on an EMR Cluster. I am able to create the RDD but when the worker nodes are trying to fetch the data the code fails saying the file not found (keystore not exists) though it is present in the master node. Is there a way I can copy the Keystore file to child nodes? Below is my code and steps I am following

First step I copy the certificate to tmp folder using below command

s3-dist-cp --src s3://test/rootca_ca.jks --dest /tmp/

Then I run the below code with the following command

spark-submit --executor-memory=10G --driver-memory=10G --executor-cores=2 --jars s3://test1/jars/presto-jdbc-338-e.0.jar  --class com.asurion.prestotest --master yarn s3://test1/script/prestotest.jar 


package com.asurion

import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
import org.apache.spark.sql.SQLContext 
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel
import java.time.LocalDateTime
import java.util.concurrent._


object prestotest {
  def main(args: Array[String]) {
        Logger.getLogger("org").setLevel(Level.OFF)
       Logger.getLogger("akka").setLevel(Level.OFF)

    val conf = new SparkConf().setAppName("testapp")
     val sc = new SparkContext(conf);
    val sqlcontext = SparkSession.builder().getOrCreate()
        
   
    val carrier_info =" select * from test_tbl  "


    val enrdata = sqlcontext.read.format("jdbc").option("url", "jdbc:presto://test.atlas.prd.aws.test.net:18443/hive").option("SSL","true").option("SSLTrustStorePath","/tmp/rootca_ca.jks").option("SSLTrustStorePassword","pass1").option("query", carrier_info).option("user", "user1").option("password", "pass2").option("driver", "io.prestosql.jdbc.PrestoDriver").load()
    

    println("Writing Statistics"   )    
    enrdata.show(5)
    
     println("Writing done"   )     


  }
}

Error:

scheduler.TaskSetManager (Logging.scala:logWarning(66)): Lost task 0.0 in stage 0.0 (TID 0, 100.64.187.253, executor 1): java.sql.SQLException: Error setting up SSL: /tmp/rootca_ca.jks (No such file or directory)
    at io.prestosql.jdbc.PrestoDriverUri.setupClient(PrestoDriverUri.java:235)
    at io.prestosql.jdbc.PrestoDriver.connect(PrestoDriver.java:88)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:63)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$1.apply(JdbcUtils.scala:54)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:272)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:121)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: io.prestosql.jdbc.$internal.client.ClientException: Error setting up SSL: /tmp/rootca_ca.jks (No such file or directory)
    at io.prestosql.jdbc.$internal.client.OkHttpUtil.setupSsl(OkHttpUtil.java:241)
    at io.prestosql.jdbc.PrestoDriverUri.setupClient(PrestoDriverUri.java:203)
    ... 23 more
Caused by: java.io.FileNotFoundException: /tmp/rootca_ca.jks (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at io.prestosql.jdbc.$internal.client.OkHttpUtil.loadTrustStore(OkHttpUtil.java:308)
    at io.prestosql.jdbc.$internal.client.OkHttpUtil.setupSsl(OkHttpUtil.java:220)

Solution

  • Spark on EMR create the driver in one of of the CORE node (by default) not on master code.

    A driver (On CORE node) can't access files in master node.

    So what option do you have -

    1. While spinning up EMR cluster write a bootstrap script to copy the rootca_ca.jks file (s3 cp) to every worker (CORE & TASK) node and you don't to change anything in the program
    2. As you are using s3-dist-cp to copy the file, it puts your file in HDFS not in linux file system.
      To access the file you need to add file system perfix. hdfs:///tmp/rootca_ca.jks.
      You don't need to put name-node address & port because its configures by EMR in core-site.xml
    3. As your files are already in S3, you can use EMRFS (which is nothing but S3 as a hadoop file system). To access a file using EMRFS just put S3 url s3://test/rootca_ca.jks.
      But make sure your EC2 EMR IAM role has access to get the object from S3
      Also it comes with its own cost but as you file is small you can leverage this

    • Option 1: is very hard to maintain over time, if your file needs to be changed on running cluster you will have to do it manually on every worker node.

    • Option 2: as HDFS is shared file system, you can need to maintain the file in one place.

    • Option 3: there is not much difference as option 2, just that your file stays and read from S3. You don't have to copy the file in HDFS.