Search code examples
scalaapache-spark

Scala 2.12.10 with Spark 3.0.0 : What does "data.map(Tuple1.apply)" do?


I'm following an example for PCA analysis in Spark 3.0.0, using Scala 2.12.10. I'm having trouble understanding some of the nuances of Scala and I'm quite new to programming in Scala.

After defining the data as such:

val data = Array(
            Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
            Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
            Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
        )

the dataframe is created as such:

val df = spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")

My question is: what does data.map(Tuple1.apply) do? I guess what bugs me is the fact apply doesn't have arguments.

Thank you in advance! Perhaps someone can also recommend me a good beginner Scala / Spark book so my questions can be better ones in the future?


Solution

  • It makes a Tuple of 1 element that the toDF can use as input to create a Dataframe with 1 column of type vector. That's all, but very handy.

    Some references https://mungingdata.com/apache-spark/best-books/. I found the Databricks courses too simple and omitting relevant aspects. Some good sites exist: https://sparkbyexamples.com/ https://www.waitingforcode.com/ This latter offers a good course at little cost.

    On Scala apply there is also an excellent answer on SO What is the apply function in Scala?