Search code examples
apache-sparkmatrix-inversescala-breeze

Can Spark and the ScalaNLP library Breeze be used together?


I'm developing a Scala-based extreme learning machine, in Apache Spark. My model has to be a Spark Estimator and use the Spark framework in order to fit into the machine learning pipeline. Does anyone know if Breeze can be used in tandem with Spark? All of my data is in Spark data frames and conceivably I could import it using Breeze, use Breeze DenseVectors as the data structure then convert to a DataFrame for the Estimator part. The advantage of Breeze is that it has a function pinv for the Moore-Penrose pseudo-inverse, which is an inverse for a non-square matrix. There is no equivalent function in the Spark MLlib, as far as I can see. I have no idea whether it's possible to convert Breeze tensors to Spark DataFrames so if anyone has experience of this it would be really useful. Thanks!


Solution

    • Breeze can be used with Spark. In fact is used internally for many MLLib functions, but required conversions are not exposed as public. You can add your own conversions and use Breeze to process individual records.

      For example for Vectors you can find conversion code:

      For Matrices please see asBreeze / fromBreeze in Matrices.scala

    • It cannot however, be used on distributed data structures. Breeze objects use low level libraries, which cannot be used for distributed processing. Therefore DataFrame - Breeze objects conversions are possible only if you collect data to the driver and are limited to the scenarios where data can be stored in the driver memory.

    • There exist other libraries, like SysteML, which integrate with Spark and provide more comprehensive linear algebra routines on distributed objects.