Search code examples
spark-streamingdatastax-enterprise

Datastax Enterprise File System (DSEFS): Error while using with Spark Streaming


I enabled data stax enterprise file system following the link

https://docs.datastax.com/en/latest-dse/datastax_enterprise/ana/enablingDsefs.html

I am able to use the dse fs shell. I created a folder /checkpoint.

When I use this folder as a checkpoint directory (dsefs://:5598/checkpoint) during spark streaming I am getting following error:

Exception in thread "main" java.io.IOException: No FileSystem for scheme: dsefs
        at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2644)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
        at org.apache.spark.streaming.StreamingContext.checkpoint(StreamingContext.scala:234)
        at org.apache.spark.streaming.api.java.JavaStreamingContext.checkpoint(JavaStreamingContext.scala:577)
        at com.sstech.captiveyes.data.streaming.StreamingVisitClassifierMerge.main(StreamingVisitClassifierMerge.java:96)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
        at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Am I missing some configuration step.


Solution

  • The essential part of Hadoop configuration is:

    <property>
      <name>fs.dsefs.impl</name>
      <value>com.datastax.bdp.fs.hadoop.DseFileSystem</value>
    </property>
    

    Put it in your Hadoop core-site.xml file. Or you may set this property in the Hadoop Configuration object.

    If you're running this on a DSE node, this setting will be automatically configured in the dse-core-default.xml for you on startup when you enable workload Analytics. Therefore it should just work out of the box with DSE Spark.

    If you're running this on an external Spark cluster, read the Bring Your Own Spark section of the DSE documentation: https://docs.datastax.com/en/latest-dse/datastax_enterprise/spark/byosIntro.html. It describes how to setup your Spark to access not only DSEFS but also Cassandra.