Search code examples
text-processingsimilaritymahout

Calculating cosine similarity in mahout


In order to find the similarity between two documents , i am planning to adopt the use of mahout to perform this task .

The process would include :

  1. converting the doc to tf-idf
  2. Removing stop words (making the search effective)
  3. Run cosine similarity
  4. Give the degree of similarity

I am planning to implement this in mahout . I am a beginner to mahout , can somebody help me out with a few tutorials to perform this and tell me if this is a effective means to calculate the similarity between the documents


Solution

  • You don't need to implement anything. Use seqdirectory and seq2sparse to vectorize your data. After that you can use RowSimilarityJob to compute pairwise cosine similarities.