This question is about MLlib (Spark 1.2.1+).
What is the best way to manipulate local matrices (moderate size, under 100x100, so does not need to be distributed).
For instance, after computing the SVD of a dataset, I need to perform some matrix operation.
The RowMatrix
only provide a multiply function. The toBreeze method returns a DenseMatrix<Object>
but the API does not seem Java friendly:
public final <TT,B,That> That $plus(B b, UFunc.UImpl2<OpAdd$,TT,B,That> op)
In Spark+Java, how to do any of the following operations:
Javadoc RowMatrix: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html
RDD<Vector> data = ...;
RowMatrix matrix = new RowMatrix(data);
SingularValueDecomposition<RowMatrix, Matrix> svd = matrix.computeSVD(15, true, 1e-9d);
RowMatrix U = svd.U();
Vector s = svd.s();
Matrix V = svd.V();
//Example 1: How to compute transpose(U)*matrix
//Example 2: How to compute transpose(U(:,1:k))*matrix
EDIT: Thanks for dlwh for pointing me in the right direction, the following solution works:
import no.uib.cipr.matrix.DenseMatrix;
// ...
RowMatrix U = svd.U();
DenseMatrix U_mtj = new DenseMatrix((int) U.numCols(), (int) U.numRows(), U.toBreeze().toArray$mcD$sp(), true);
// From there, matrix operations are available on U_mtj
Breeze just doesn't provide a Java-friendly API. (And, speaking as the main author, I have no plans to: it would hamstring the API too much.)
You can probably exploit the fact that MTJ uses the same dense matrix representation as we do. (Well, almost. Their API doesn't expose majorStride, but that shouldn't be an issue for you.)
That is, you can do something like this:
import no.uib.cipr.matrix.DenseMatrix;
// ...
breeze.linalg.DenseMatrix[Double] Ubreeze = U.toBreeze();
new DenseMatrix(Ubreeze.cols(), Ubreeze.rows(), Ubreeze.data());