I am having trouble trying to write a search engine that treats all inflections of a word as the same basic word.
subjunctives (e.g. might be; critical that something be finished; I wish it were)
For example, with “enable”, I don’t want “enables” and “enabled” printed as separate entries. All three of those should count as the same basic word, the verb enable.
I can prevent printing of duplicates using a hash like:
unless ($seenmatches{ $headmatches[$l] }++)
Could someone explain this? Explained in comments below.
This doesn’t stop the plural/past from continuing on. Is there a way to do this, or some wholely distinct approach, perhaps one involving a regex and/or substitution, then an unsub later?
I can't modify the word with a substitution, because then the print would not print out right. Although I'm not at the stage yet, eventually I'd like to include irregular past tenses [ᴇᴅɪᴛᴏʀ's ɴᴏᴛᴇ: and irregular nouns, too?] as well
Im not sure what else you need to answer my question, so please just let me know anything I’ve unintentionally left out, and I'll fill in any missing bits to help make it clearer.
The way a typical search engine works is as follows:
Lingua::Stem
(or, better, Lingua::Stem::Snowball
) which are slightly updated versions of the Porter stemmerNow, when a query arrives, it too is tokenized and each token stemmed, but we don't care about positions this time. We look up each token against those we have indexed, to locate the postings (matching document identifiers). We can now retrieve the stored start/end offsets to determine where the terms were in the original text.
So, you do lose the suffixes for the index (which is what it used to locate matching documents) but you preserve the original text and offsets for those documents, so you can do query highlighting and nice display stuff should you need to.
Stemming is definitely the right tool for this job. The main trick is to ensure you treat the query and the documents in the same way. You can modify the original document, but really, you want to transform it into something like a back of book index, not into a string you use regular expressions on -- if you really are doing search engine stuff, that is. Check out the excellent KinoSearch
module on CPAN if you like, or look at the Apache Lucene project it was originally derived from.