file text information-retrieval similarity

How to compute similarity between two license.txt files?

I would like to compute similarity between licenses' txt files so I could then based on the license.txt identify to which license it corresponds. What kind of information retrieval technique should I use? Once I programmed tf-idf but I am not sure whether this is applicable here. What do you suggest?

Solution

I've been working on this issue for 3+ years, let me tell you it's far from trivial, and you are not going to solve it with a single algorithm, let alone tf-idf and cosine similarity.

There are a number of challenges, I write some of them:

similar license texts (agpl/gpl/lgpl, bsd/apache1.1/openssl, mit/isc/curl) are extremely difficult to disambiguate, and would have an extremely high cosine similarity (unless you are very smart about feature selection, maybe...)
same applies to different versions of the same license (lgpl 2.0/2.1)
LICENSE.TXT files often contain multiple licenses
bsd notices are very hard to catch, ie. you have the same text, except for the rights holder

You will end up using a combination of approaches, there's no silver bullet unfortunately.