Detect duplicated/similar text among large datasets?

I have a large database with thousands records. Every time a user post his information I need to know if there is already the same/similar record. Are there any algorithms or open source implementations to solve this problem?

We're using Chinese, and what 'similar' means is the records have most identical content, might be 80%-100% are the same. Each record will not be too big, about 2k-6k bytes

Solution

http://d3s.mff.cuni.cz/~holub/sw/shash/

http://matpalm.com/resemblance/simhash/

What is the complexity of the sorted() function?
Implementation for a Monotonic Increasing Queue
Java, Shifting Elements in an Array
Differences between backtracking and brute-force search
Number of partition of `n` into sum of three squares (fast algorithm)
Big Number Subtraction in C
If the n-body problem is chaotic, why isn't it used as a RNG?
Using BFS for Weighted Graphs
Removing a node in an undirected graph that destroys a path between two other nodes
Binary search function is correct but it returns undefined
Algorithm to reverse an array/string only in terms of rotate operations
Correctness of Deletion algorithm of BST in CLRS
Time Complexities n(log(n)) and log(n^n)
How do I find the closest possible sum of an array's elements to a particular value?
Split on regex (more than a character, maybe variable width) and keep the separator like GNU awk
Find the minimum number of edits to balance parentheses?
How to improve pandas DF processing time on different combinations of calculated data
Efficiently calculate an availability date from a list of project assignments and known capacity
Optimized algorithm to schedule tasks with dependency?
Difference between back tracking and dynamic programming
Javascript data structures library
Segmentation error when implementing the Hoare quick sort method
Which of these implementations is canonical: storing the head and size variables, or storing the head, tail, and size?
Visualization of calendar events. Algorithm to layout events with maximum width
Tracing edges of pixels in a mask
Longest Subarray with Maximum Bitwise AND
Assigning codes to nodes such that codes are unique and create two groups which share common properties
Randomly Split a Graph into Mini Graphs
Recursion relation and overlapping sub problems
Calculating the 90th percentile in O(n) time