Search code examples
scalaapache-sparkapache-spark-mllibtf-idfcosine-similarity

Cosine similarity on TFIDF using apache spark


I'm trying to compute cosine similarity matrix on TFIDF, using Apache Spark. Here is my code:

def cosSim(input: RDD[Seq[String]]) = {
  val hashingTF = new HashingTF()
  val tf = hashingTF.transform(input)
  tf.cache()
  val idf = new IDF().fit(tf)
  val tfidf = idf.transform(tf)
  val mat = new RowMatrix(tfidf)
  val sim = mat.columnSimilarities
  sim
}

I have around 3000 lines in the input, but if I do sim.numRows() or sim.numCols() I will see 1048576 instead of 3K, as I understand, it's because val tfidf and therefore val mat both has size 3K * 1048576 where 1048576 it's number of tf features. Maybe to solve the problem I have to transpose mat, but I don't know how to do it.


Solution

  • You can try:

    import org.apache.spark.mllib.linalg.distributed._
    
    val irm = new IndexedRowMatrix(rowMatrix.rows.zipWithIndex.map {
       case (v, i) => IndexedRow(i, v)
    })
    
    irm.toCoordinateMatrix.transpose.toRowMatrix.columnSimilarities