I enabled data stax enterprise file system following the link
https://docs.datastax.com/en/latest-dse/datastax_enterprise/ana/enablingDsefs.html
I am able to use the dse fs shell. I created a folder /checkpoint.
When I use this folder as a checkpoint directory (dsefs://:5598/checkpoint) during spark streaming I am getting following error:
Exception in thread "main" java.io.IOException: No FileSystem for scheme: dsefs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2644)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.streaming.StreamingContext.checkpoint(StreamingContext.scala:234)
at org.apache.spark.streaming.api.java.JavaStreamingContext.checkpoint(JavaStreamingContext.scala:577)
at com.sstech.captiveyes.data.streaming.StreamingVisitClassifierMerge.main(StreamingVisitClassifierMerge.java:96)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Am I missing some configuration step.
The essential part of Hadoop configuration is:
<property>
<name>fs.dsefs.impl</name>
<value>com.datastax.bdp.fs.hadoop.DseFileSystem</value>
</property>
Put it in your Hadoop core-site.xml
file. Or you may set this property in the Hadoop Configuration
object.
If you're running this on a DSE node, this setting will be automatically configured in the dse-core-default.xml
for you on startup when you enable workload Analytics. Therefore it should just work out of the box with DSE Spark.
If you're running this on an external Spark cluster, read the Bring Your Own Spark section of the DSE documentation: https://docs.datastax.com/en/latest-dse/datastax_enterprise/spark/byosIntro.html. It describes how to setup your Spark to access not only DSEFS but also Cassandra.