In order to find the similarity between two documents , i am planning to adopt the use of mahout to perform this task .
The process would include :
I am planning to implement this in mahout . I am a beginner to mahout , can somebody help me out with a few tutorials to perform this and tell me if this is a effective means to calculate the similarity between the documents
You don't need to implement anything. Use seqdirectory and seq2sparse to vectorize your data. After that you can use RowSimilarityJob to compute pairwise cosine similarities.