I'm trying to compute cosine similarity matrix on TFIDF, using Apache Spark. Here is my code:
def cosSim(input: RDD[Seq[String]]) = {
val hashingTF = new HashingTF()
val tf = hashingTF.transform(input)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf)
val mat = new RowMatrix(tfidf)
val sim = mat.columnSimilarities
sim
}
I have around 3000 lines in the input, but if I do sim.numRows() or sim.numCols() I will see 1048576 instead of 3K, as I understand, it's because val tfidf and therefore val mat both has size 3K * 1048576 where 1048576 it's number of tf features. Maybe to solve the problem I have to transpose mat, but I don't know how to do it.
You can try:
import org.apache.spark.mllib.linalg.distributed._
val irm = new IndexedRowMatrix(rowMatrix.rows.zipWithIndex.map {
case (v, i) => IndexedRow(i, v)
})
irm.toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities