Search code examples
scalaintellij-ideaapache-spark-mllib

How to run MLlib usage examples with Intellij IDE?


I'm an absolute beginner with Scala, I want to test the Machine Learning library of Spark MLlib on some very simple examples.

I took the first example of the MLlib Main Guide "Basic Statistics" and I tried to reproduce it with a worksheet using IntelliJ IDEA, initialised exactly as the SCALA documentation describes and having imported correctly all libraries dependencies.

So here is the code:

import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row

val data = Seq(
  Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
  Vectors.dense(4.0, 5.0, 0.0, 3.0),
  Vectors.dense(6.0, 7.0, 0.0, 8.0),
  Vectors.sparse(4, Seq((0, 9.0), (3, 1.0)))
)

val df = data.map(Tuple1.apply).toDF("features")
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
println(s"Pearson correlation matrix:\n $coeff1")

val Row(coeff2: Matrix) = Correlation.corr(df, "features", "spearman").head
println(s"Spearman correlation matrix:\n $coeff2")

The problem arises with toDF: IntelliJ cannot resolve this symbol, thus it cannot create the Dataframe. I would like to know exactly how to fix this problem. I tried to use

val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

like described in SPARK/SQL:spark can't resolve symbol toDF but it did not work. The exact solution would be very much appreciated to continue learning and trying the next examples of the MLlib guide.


Solution

  • I just found out that I had to initialise properly the Spark Session. Here is the code to add.

    import org.apache.spark.sql.SparkSession
    
    val spark = SparkSession
      .builder()
      .appName("Basic statistics")
      .config("spark.master", "local")
      .getOrCreate()
    
    import spark.implicits._