Search code examples
scalaapache-sparksbtrddsbt-assembly

sortBy is not a member of org.apache.spark.rdd.RDD


Hello~ I'm interested in SPARK. I use this below code in spark-shell.

val data = sc.parallelize(Array(Array(1,2,3), Array(2,3,4), Array(1,2,1))
res6: org.apache.spark.rdd.RDD[Array[Int]] = ParallelCollectionRDD[0] at parallelize at <console>:26

data.map(x => (x(d), 1)).reduceByKey((x,y) => x + y).sortBy(_._1)
res9: Array[(Int, Int)] = Array((1,2), (2,1))

It work. But, if I use this command using sbt assembly, It's not worked.

The error message is

[error] value sortBy is not a member of org.apache.spark.rdd.RDD[(Int, Int)]

[error] data.map(x => (x(d), 1)).reduceByKey((x,y) => x + y).sortBy(_._1) <= here is the problem.

my build.sbt code is

import AssemblyKeys._

assemblySettings

name := "buc"

version := "0.1"

scalaVersion := "2.10.5"

libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.0.0" % "provided"

Is there something problem?


Solution

  • The first problem is that you are using spark 1.0.0, and if you read the documentation you won't find any sortBy method in the RDD class. So,you should update from 1.0.x to 2.0.x.

    On other hand, the spark-mllib dependency is used to get the Spark MLlib library and that's not what you need. You need to get the dependency for spark-core :

    libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "2.0.0" % "provided"