Search code examples
phptextlevenshtein-distance

How to check if a text is contained into another?


I'm developing a documents system that, each time that a new one is created, it has to detect and discard duplicates in a database of about 500.000 records.

For now, I'm using a search engine to retrieve the 20 most similar documents, and compare them with the new one that we're trying to create. The problem is that I have to check if the new document is similar (that's easy with similar_text), or even if it's contained inside the other text, all this operations considering that the text may have been partly changed by the user (here is the problem). How I can do that?

For example:

<?php

$new = "the wild lion";

$candidates = array(
  'the dangerous lion lives in Africa',//$new is contained into this one, but has changed 'wild' to 'dangerous', it has to be detected as duplicate
  'rhinoceros are native to Africa and three to southern Asia.'
);

foreach ( $candidates as $candidate ) {
  if( $candidate is similar or $new is contained in it) {
       //Duplicated!!
  }
}

Of course, in my system the documents are longer than 3 words :)


Solution

  • This is the temporal solution I'm using:

    function contained($text1, $text2, $factor = 0.9) {
        //Split into words
        $pattern= '/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/u';
        $words1 = preg_split($pattern, mb_strtolower($text1), -1, PREG_SPLIT_NO_EMPTY);
        $words2 = preg_split($pattern, mb_strtolower($text2), -1, PREG_SPLIT_NO_EMPTY);
    
        //Set long and short text
        if (count($words1) > count($words2)) {
            $long = $words1;
            $short = $words2;
        } else {
            $long = $words2;
            $short = $words1;
        }
    
        //Count the number of words of the short text that also are in the long
        $count = 0;
        foreach ($short as $word) {
            if (in_array($word, $long)) {
                $count++;
            }
        }
    
        return ($count / count($short)) > $factor;
    }