Search code examples
apache-sparkspark-structured-streaming

Apache Spark (Structured Streaming) : S3 Checkpoint support


From the spark structured streaming documentation: "This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when starting a query."

And sure enough, setting the checkpoint to a s3 path throws:

17/01/31 21:23:56 ERROR ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: Wrong FS: s3://xxxx/fact_checkpoints/metadata, expected: hdfs://xxxx:8020 
java.lang.IllegalArgumentException: Wrong FS: s3://xxxx/fact_checkpoints/metadata, expected: hdfs://xxxx:8020 
        at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:652) 
        at org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:194) 
        at org.apache.hadoop.hdfs.DistributedFileSystem.access$000(DistributedFileSystem.java:106) 
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305) 
        at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) 
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) 
        at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1301) 
        at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1430) 
        at org.apache.spark.sql.execution.streaming.StreamMetadata$.read(StreamMetadata.scala:51) 
        at org.apache.spark.sql.execution.streaming.StreamExecution.<init>(StreamExecution.scala:100) 
        at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232) 
        at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269) 
        at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262) 
        at com.roku.dea.spark.streaming.FactDeviceLogsProcessor$.main(FactDeviceLogsProcessor.scala:133) 
        at com.roku.dea.spark.streaming.FactDeviceLogsProcessor.main(FactDeviceLogsProcessor.scala) 
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
        at java.lang.reflect.Method.invoke(Method.java:498) 
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637) 
17/01/31 21:23:56 INFO SparkContext: Invoking stop() from shutdown hook 

A couple of questions here:

  1. Why is s3 not supported as a checkpoint dir (regular spark streaming supports this)? What makes a filesystem "HDFS compliant" ?
  2. I use HDFS emphemerally (since clusters can come up or down all the time) and use s3 as the place to persist all data - what would be the recommendations for storing checkpointing data for structured streaming data in such a setup?

Solution

  • What makes an FS HDFS "compliant?" it's a file system, with the behaviours specified in Hadoop FS specification. The difference between an object store and FS is covered there, with the key point being "eventually consistent object stores without append or O(1) atomic renames are not compliant"

    For S3 in particular

    1. It's not consistent: after a new blob is created, a list command often doesn't show it. Same for deletions.
    2. When a blob is overwritten or deleted, it can take a while to go away
    3. rename() is implemented by copy and then delete

    Spark streaming checkpoints by saving everything to a location and then renaming it to the checkpoint directory. This makes the time to checkpoint proportional to the time to do a copy of the data in S3, which is ~6-10 MB/s.

    The current bit of streaming code isn't suited for s3

    For now, do one of

    • checkpoint to HDFS and then copy over the results
    • checkpoint to a bit of EBS allocated and attached to your cluster
    • checkpoint to S3, but have a long gap between checkpoints so that the time to checkpoint doesn't bring your streaming app down.

    If you are using EMR, you can pay the premium for a consistent, dynamo DB backed S3, which gives you better consistency. But copy time is still the same, so checkpointing will be just as slow