I have two corpuses (Corpus 1 & Corpus 2), documents in corpus 1 contain plagiarized sentences from Corpus 2. I'm using Tf-Idf approach to measure the similarity between documents in corpus 1 against docs in Corpus 2.
An inverted index for terms in corpus 2 has been built, as follows:
Shortly, for each two sentences' comparison, I have built two Tf-Idf vectors then i have measure the similarity using Cosine similarity.
My question is, in the building process of vectors that are relate to sentences of corpus 1 I have used Corpus 2 index to get Idf by sum up documents that relates to X term, is it a right way !? since some terms that are in Corpus 1 are not available in Corpus 2 and Tf-idf function will return 0 for these terms! or i have to build another index for corpus 1 (which will eliminates Tf-idf power in my opinion).
We have to index the target corpus, that we need to accomplish our work, as example: if we have 2 corpuses, original and plagiarized one. We have to index the original one since we need to search through.