Search code examples
algorithmsimilarity

Detect duplicated/similar text among large datasets?


I have a large database with thousands records. Every time a user post his information I need to know if there is already the same/similar record. Are there any algorithms or open source implementations to solve this problem?

We're using Chinese, and what 'similar' means is the records have most identical content, might be 80%-100% are the same. Each record will not be too big, about 2k-6k bytes


Solution

  • http://d3s.mff.cuni.cz/~holub/sw/shash/

    http://matpalm.com/resemblance/simhash/