I would like to compute similarity between licenses' txt files so I could then based on the license.txt identify to which license it corresponds. What kind of information retrieval technique should I use? Once I programmed tf-idf but I am not sure whether this is applicable here. What do you suggest?
I've been working on this issue for 3+ years, let me tell you it's far from trivial, and you are not going to solve it with a single algorithm, let alone tf-idf and cosine similarity.
There are a number of challenges, I write some of them:
You will end up using a combination of approaches, there's no silver bullet unfortunately.