I'm an absolute beginner with Scala, I want to test the Machine Learning library of Spark MLlib on some very simple examples.
I took the first example of the MLlib Main Guide "Basic Statistics" and I tried to reproduce it with a worksheet using IntelliJ IDEA, initialised exactly as the SCALA documentation describes and having imported correctly all libraries dependencies.
So here is the code:
import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row
val data = Seq(
Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
Vectors.dense(4.0, 5.0, 0.0, 3.0),
Vectors.dense(6.0, 7.0, 0.0, 8.0),
Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
)
val df = data.map(Tuple1.apply).toDF("features")
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
println(s"Pearson correlation matrix:\n $coeff1")
val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head
println(s"Spearman correlation matrix:\n $coeff2")
The problem arises with toDF: IntelliJ cannot resolve this symbol, thus it cannot create the Dataframe. I would like to know exactly how to fix this problem. I tried to use
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
like described in SPARK/SQL:spark can't resolve symbol toDF but it did not work. The exact solution would be very much appreciated to continue learning and trying the next examples of the MLlib guide.
I just found out that I had to initialise properly the Spark Session. Here is the code to add.
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Basic statistics")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._