I'm developing a documents system that, each time that a new one is created, it has to detect and discard duplicates in a database of about 500.000 records.
For now, I'm using a search engine to retrieve the 20 most similar documents, and compare them with the new one that we're trying to create. The problem is that I have to check if the new document is similar (that's easy with similar_text), or even if it's contained inside the other text, all this operations considering that the text may have been partly changed by the user (here is the problem). How I can do that?
For example:
<?php
$new = "the wild lion";
$candidates = array(
'the dangerous lion lives in Africa',//$new is contained into this one, but has changed 'wild' to 'dangerous', it has to be detected as duplicate
'rhinoceros are native to Africa and three to southern Asia.'
);
foreach ( $candidates as $candidate ) {
if( $candidate is similar or $new is contained in it) {
//Duplicated!!
}
}
Of course, in my system the documents are longer than 3 words :)
This is the temporal solution I'm using:
function contained($text1, $text2, $factor = 0.9) {
//Split into words
$pattern= '/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/u';
$words1 = preg_split($pattern, mb_strtolower($text1), -1, PREG_SPLIT_NO_EMPTY);
$words2 = preg_split($pattern, mb_strtolower($text2), -1, PREG_SPLIT_NO_EMPTY);
//Set long and short text
if (count($words1) > count($words2)) {
$long = $words1;
$short = $words2;
} else {
$long = $words2;
$short = $words1;
}
//Count the number of words of the short text that also are in the long
$count = 0;
foreach ($short as $word) {
if (in_array($word, $long)) {
$count++;
}
}
return ($count / count($short)) > $factor;
}