Search code examples
javaalgorithmtextsimilarity

Text similarity Algorithms


I'm doing a Java project where I have to make a text similarity program. I want it to take 2 text documents, then compare them with each other and get the similarity of it. How similar they are to each other.

I'll later put an already database which can find the synonyms for the words and go through the text to see if one of the text document writers just changed the words to other synonyms while the text is exactly the same. Same thing with moving paragrafs up or down. Yes, as was it a plagarism program...

I want to hear from you people what kind of algoritms you would recommend.

I've found Levenstein and Cosine similarity by looking here and other places. Both of them seem to be mentioned a lot. Hamming distance is another my teacher told me about.

I got some questions related to those since I'm not really getting Wikipedia. Could someone explain those things to me?

Levenstein: This algorithm changed by sub, add and elimination the word and see how close it is to the other word in the text document. But how can that be used on a whole text file? I can see how it can be used on a word, but not on a sentence or a whole text document from one to another.

Cosine: It's measure of similarity between two vectors by measuring the cosine of the angle between them. What I don't understand here how two text can become 2 vectors and what about the words/sentence in those?

Hamming: This distance seems to work better than Levenstein but it's only on equal strings. How come it's important when 2 documents and even the sentences in those aren't two strings of equal length?

Wikipedia should make sense but it's not. I'm sorry if the questions sound too stupid but it's hanging me down and I think there's people in here who's quite capeable to explain it so even newbeginners into this field can get it.

Thanks for your time.


Solution

  • Levenstein: in theory you could use it for a whole text file, but it's really not very suitable for the task. It's really intended for single words or (at most) a short phrase.

    Cosine: You start by simply counting the unique words in each document. The answers to a previous question cover the computation once you've done that.

    I've never used Hamming distance for this purpose, so I can't say much about it.

    I would add TFIDF (Term Frequency * Inverted Document Frequency) to the list. It's fairly similar to Cosine distance, but 1) tends to do a better job on shorter documents, and 2) does a better job of taking into account what words are extremely common in an entire corpus rather than just the ones that happen to be common to two particular documents.

    One final note: for any of these to produce useful results, you nearly need to screen out stop words before you try to compute the degree of similarity (though TFIDF seems to do better than the others if yo skip this). At least in my experience, it's extremely helpful to stem the words (remove suffixes) as well. When I've done it, I used Porter's stemmer algorithm.

    For your purposes, you probably want to use what I've dubbed an inverted thesaurus, which lets you look up a word, and for each word substitute a single canonical word for that meaning. I tried this on one project, and didn't find it as useful as expected, but it sounds like for your project it would probably be considerably more useful.