Search code examples
apache-sparkamazon-s3hdfsspark-structured-streamingspark-kafka-integration

Error while pulling kafka jks certificates from hdfs (trying with s3 as well) in spark


I am Running spark in cluster mode which is giving error as

ERROR SslEngineBuilder: Modification time of key store could not be obtained: hdfs://ip:port/user/hadoop/jks/kafka.client.truststore.jks
java.nio.file.NoSuchFileException: hdfs:/ip:port/user/hadoop/jks/kafka.client.truststore.jks

I ran below command and verified that jks files are present at the location.

hadoop fs -ls hdfs://ip:port/user/hadoop/\<folder1\>

I have written below code to connect to kafka in spark project.

Spark Code:

sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", )
.option("subscribe", )
...
.option("kafka.ssl.keystore.password", "pswd")
.option("kafka.ssl.key.password", "pswrd"))      
.option("kafka.ssl.truststore.location","hdfs:///node:port/user/hadoop/\<folder1\>/kafka.client.truststore.jks")
.option("kafka.ssl.keystore.location", "hdfs:///node:port/user/hadoop/\<folder1\>/kafka.client.keystore.jks")
  1. Please suggest what is missing?
  2. How to achieve the same with jks file in s3?

Solution

  • You need to use --files s3a://... (or with hdfs) on your spark-submit option, or use spark.files option in the built session.

    Then you can refer to those files by name directly (not with the full path), as they are looked up by relative path to the Spark executor.

    For reading from S3, you'll also need to (securely) define your S3 access keys (i.e. not as plaintext in your Spark code). Use an hdfs-site.xml resource file.