So, suppose I have a simple array of sentences. What would be the best way to search it based on user input, and return the closest match?
The Levenshtein functions seem promising, but I don't think I want to use them. User input may be as simple as highest mountain
, in which case I'd want to search for the sentence in the array that has highest mountain
. If that exact phrase does not exist, then I'd want to search for the sentence that has highest
AND mountain
, but not back-to-back, and so on. The Levenshtein functions work on a per-character basis, but what I really need is a per-word basis.
Of course, to some degree, Levenshtein functions may still be useful, as I'd also want to take into account the possibility of the sentence containing the phrase highest mountains
(notice the S) or similar.
What do you suggest? Are there any systems for PHP that do this that already exist? Would Levenshtein functions alone be an adequate solution? Is there a word-based Levenshtein function that I don't know about?
Thanks!
EDIT - I have considered both MySQL fulltext search, and have also considered the possibility of breaking both A) input and B) each sentence into separate arrays of words, and then compared that way, using Levenshtein functions to account for variations in words. (color, colour, colors, etc) However, I am concerned that this method, though possibly clever, may be computationally taxing.
Check this: http://framework.zend.com/manual/en/zend.search.lucene.overview.html
Zend_Search_Lucene offers a HTML parsing feature. Documents can be created directly from a HTML file or string:
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$index->addDocument($doc);