Remove duplicating docs of docs with high similarity

When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?

Solution

You already received some excellent answers. But if you prefer a more automated approach targeted at your specific use case, you can use the package LexisNexisTools (which I wrote). It comes with a function called lnt_similarity(), which does exactly what you were looking for. I wrote a quick tutorial with mock data here.

The main difference between the solutions here and in lnt_similarity() is that I also take into account word order, which can make a big difference in some rare cases (see this blog post).

I also suggest you think carefully about thresholds as you might otherwise remove some articles wrongfully. I included a function to visualize the difference between two articles so you can get a better grip of the data you are removing called lnt_diff().