Search code examples
apache-sparkamazon-s3minio

Starting up Spark History Server to write to minIO


I'm trying to get Spark History Server to run on my cluster that is running on Kubernetes, and I'd like the logs to get written to minIO. I'm also using minIO as storage of the input and output of my spark-submit jobs, which is working already.

Currectly working spark-submit jobs

My working spark-submit job looks something like the following:

spark-submit \
  --conf spark.hadoop.fs.s3a.access.key=XXXX \
  --conf spark.hadoop.fs.s3a.secret.key=XXXX \
  --conf spark.hadoop.fs.s3a.endpoint=https://someIpv4 \
  --conf spark.hadoop.fs.s3a.connection.ssl.enabled=true \
  --conf spark.hadoop.fs.s3a.path.style.access=true \
  --conf spark.hadoop.fs.default.name="s3a:///" \
  --conf spark.driver.extraJavaOptions="-Djavax.net.ssl.trustStore=XXXX -Djavax.net.ssl.trustStorePassword=XXXX \
  --conf spark.executor.extraJavaOptions="-Djavax.net.ssl.trustStore=XXXX -Djavax.net.ssl.trustStorePassword=XXXX \
...

As you can see, I'm using SSL to connect to minIO and to read/write files.

What am I trying

I'm trying to spin up the history server with minIO as storage without using SSL.

To start up the history server, I'm using the already present start-history-server.sh script with some configs to define the log storage location with the ./start-history-server.sh --properties-file my_conf_file command. my_conf_file looks like this:

spark.eventLog.enabled=true
spark.eventLog.dir=s3a://myBucket/spark-events
spark.history.fs.logDirectory=s3a://myBucket/spark-events
spark.hadoop.fs.s3a.access.key=XXXX
spark.hadoop.fs.s3a.secret.key=XXXX
spark.hadoop.fs.s3a.endpoint=http://someIpv4
spark.hadoop.fs.s3a.path.style.access=true
spark.hadoop.fs.s3a.connection.ssl.enabled=false

So you see I'm not adding any SSL parameters. But when I run ./start-history-server.sh --properties-file my_conf_file, I'm getting this error:

INFO AmazonHttpClient: Unable to execute HTTP request: Connection refused (Connection refused)
java.net.ConnectException: Connection refused (Connection refused)
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:607)
        at org.apache.http.conn.scheme.PlainSocketFactory.connectSocket(PlainSocketFactory.java:121)
        at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)
        at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:326)
        at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:610)
        at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:445)
        at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:835)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
        at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
        at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:117)
        at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:86)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:296)
        at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)

What have I tried/found on the internet

  • This person had a very similar problem to mine, but it seems like they solved it using spark.hadoop.fs.s3a.path.style.access, which I'm already using
  • I was able to spin up History server using the local filesystem, so that seems to be working correctly
  • I have seen people, like in this post, using the spark.hadoop.fs.s3a.impl key with org.apache.hadoop.fs.s3a.S3AFileSystem as value. When I do this, however, It seems like this class doesn't exist within my AWS jars.
    • I have the following AWS jars at my disposal: aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.3.jar
    • Since my spark-submit jobs are running fine, reading/writing away files to minIO, and I'm not supplying that spark.hadoop.fs.s3a.impl parameter in them I would think that that parameter is not needed?

Does anyone have an idea of where I should be looking/what I'm doing wrong?


Solution

  • My problem was that actually my minIO did not accept http requests. My already working spark submit job was using https using SSL, so I added the needed parameters to $SPARK_DAEMON_JAVA_OPTS and it was working.