Search code examples
apache-sparkapache-spark-mllibapache-spark-ml

What are the differences between the vector implementations of spark mllib and spark ml?


At a high level I'm aware that Spark MLLib is written on top of RDDs and Spark ML is build on top of DataFrames, but my understanding is lacking in detail.

Especially, the lack of compatibility of the different vector implementations made me wonder what are the differences in implementation and why were those design decisions chosen?


Solution

  • The motivation for keeping local linear algebra in ml has been explained in SPARK-13944.

    Separate out linear algebra as a standalone module without Spark dependency to simplify production deployment. We can call the new module mllib-local, which might contain local models in the future. The major issue is to remove dependencies on user-defined types.

    The package name will be changed from mllib to ml. For example, Vector will be changed from org.apache.spark.mllib.linalg.Vector to org.apache.spark.ml.linalg.Vector. The return vector type in the new ML pipeline will be the one in ML package; however, the existing mllib code will not be touched. As a result, this will potentially break the API. Also, when the vector is loaded from mllib vector by Spark SQL, the vector will automatically converted into the one in ml package.

    Right now implementation is close to identical, excluding some conversion methods,