Search code examples
nlpwordnetword2vecword-embeddingplagiarism-detection

Will Word2Vec be more efficient in text based Plagiarism detection than WordNet or any other word embeddings like GloVe, fastText etc?


I am a beginner in learning Word2Vec and just started to do some study on Word2vec from the Internet. I have gone through almost all the questions in Quora and StackOverflow but didn't get my answer anywhere from the previous questions. So my question is-

  1. Is it possible to apply word2vec in plagiarism detection?
  2. If yes, then will Word2Vec be more efficient in text-based Plagiarism detection than WordNet or any other word embeddings like GloVe, fastText, etc?

Thanks in advance.


Solution

  • Yes, these "dense embedding" models of word meaning like word2vec may be useful in plagiarism detection. (They're also likely useful in obfuscating plagiarism from simple detectors, as they can assist automated transforms on existing text that change the words while keeping the meaning similar.)

    Only by testing within a particular system and with respect to quantitative evaluations will you know for sure how well it can work, or whether a particular embedding is better or worse than something like WordNet.

    Among word2vec, fastttext, and GloVE, results will probably be very similar – they all use roughly the same info (word co-occurrences within a sliding context window) to make maximally-predictive word-vectors – so they behave very similarly with similar training data.

    Any differences are subtle – the non-GLoVe options might work better for very larger vocabularies; fasttext is essentially the word2vec in some modes, but adds new options for either modeling subword ngrams (which can then help to create better-than-random vectors for future out-of-vocabulary words) or optimizing the vectors for classification problems.

    But the vectors for known words, which can be trained with plentiful training data, are going to be very similar in capabilities if the training processes are similarly meta-optimized for your task.