Search code examples
phpalgorithmlevenshtein-distance

Calculating the distance between two articles accurately


I am writing a software to compare articles. I am looking for an efficient and accurate algorithm to calculate the difference (variation) between two articles. The variation should completely depend on words and not letters. I have tried levenshtein() but it has a time complexity of O(n*m) which is very expensive when performed on big texts like an article. I have also tried similar_text() which has a higher time complexity of O(n*m*3). Moreover, levenshtein() and similar_text() calculates the number of operations needed to transform one string to another which is not an accurate way to calculate the difference between two big articles.

What other options do I have?


EDIT:

I am trying to calculate the variation approximately from the point of view of a search engine (Google).


Solution

  • In my case, I needed to calculate the variation between two articles. So, I found that very simple solution working for me very well. It works by simply calculating the similarity as the common words between the two articles divided by max(number of words in article A, number of words in article B). The variation then is calculated by subtracting the similarity from 100 to get the variation percentage. The code below explains it all.

    function get_variation($article1,$article2){
    
          $wordsA = array_unique(preg_split('@[\W]+@', $article1));
          $wordsB = array_unique(preg_split('@[\W]+@', $article2));
          $intersection = array_intersect($wordsA, $wordsB);
          $similarity = (count($intersection)/ (max(count($wordsA),count($wordsB))) * 100);
          $similarity =  number_format($similarity, 2, '.', '');
          $variation = 100-$similarity;
          return $variation;
    }