Search code examples
scalarddsequencefile

Scala not able to save as sequence file in RDD, as per doc it is allowed


I am using Spark 1.6, as per the official doc it is allowed to save a RDD to sequence file format, however I notice for my RDD textFile:

scala> textFile.saveAsSequenceFile("products_sequence")
<console>:30: error: value saveAsSequenceFile is not a member of org.apache.spark.rdd.RDD[String]

I googled and found similar discussions seem to suggest this works in pyspark. Is my understanding to the official doc wrong? Can saveAsSequenceFile() be used in Scala?


Solution

  • The saveAsSequenceFile is only available when you have key value pairs in the RDD. The reason for this is that it is defined in PairRDDFunctions

    https://spark.apache.org/docs/2.1.1/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

    You can see that the API definition takes a K and a V.

    if you change your code above to

    import org.apache.spark.SparkContext
    import org.apache.spark.SparkConf
    import org.apache.spark.rdd._
    
    object SequeneFile extends App {
       val conf = new SparkConf().setAppName("sequenceFile").setMaster("local[1]")
       val sc = new SparkContext(conf)
       val rdd : RDD[(String, String)] = sc.parallelize(List(("foo", "foo1"), ("bar", "bar1"), ("baz", "baz1")))
       rdd.saveAsSequenceFile("foo.seq")
       sc.stop()
    }
    

    This works perfectly and you will get foo.seq file. The reason why the above works is because we have an RDD which is a key value pair and not just a RDD[String].