Search code examples
javatf-idfcosine-similarityinverted-index

Tf-Idf calculation for two corpuses


I have two corpuses (Corpus 1 & Corpus 2), documents in corpus 1 contain plagiarized sentences from Corpus 2. I'm using Tf-Idf approach to measure the similarity between documents in corpus 1 against docs in Corpus 2.

An inverted index for terms in corpus 2 has been built, as follows: Corpus 2 Index

Shortly, for each two sentences' comparison, I have built two Tf-Idf vectors then i have measure the similarity using Cosine similarity.

My question is, in the building process of vectors that are relate to sentences of corpus 1 I have used Corpus 2 index to get Idf by sum up documents that relates to X term, is it a right way !? since some terms that are in Corpus 1 are not available in Corpus 2 and Tf-idf function will return 0 for these terms! or i have to build another index for corpus 1 (which will eliminates Tf-idf power in my opinion).


Solution

  • We have to index the target corpus, that we need to accomplish our work, as example: if we have 2 corpuses, original and plagiarized one. We have to index the original one since we need to search through.