Search code examples
rquanteda

Remove duplicating docs of docs with high similarity


When downloading lexisnexis newspaper articles, there's often a lot of duplicating articles in the corpus. I want to remove them and I was thinking of doing so by using cosine similarity statistics, but I'm not sure how to automate this. Any ideas?


Solution

  • You already received some excellent answers. But if you prefer a more automated approach targeted at your specific use case, you can use the package LexisNexisTools (which I wrote). It comes with a function called lnt_similarity(), which does exactly what you were looking for. I wrote a quick tutorial with mock data here.

    The main difference between the solutions here and in lnt_similarity() is that I also take into account word order, which can make a big difference in some rare cases (see this blog post).

    I also suggest you think carefully about thresholds as you might otherwise remove some articles wrongfully. I included a function to visualize the difference between two articles so you can get a better grip of the data you are removing called lnt_diff().