Search code examples
perlsearchnlpstemminglemmatization

How to find basic, uninflected word for searching?


I am having trouble trying to write a search engine that treats all inflections of a word as the same basic word.

  1. So for verbs these are all the same root word, be:
    • number/person (e.g. am; is; are)
    • tense/mood like past or future tense (e.g. was; were; will be)
    • past participles (e.g. has been; had been)
    • present participles and gerunds (e.g. is being; wasn't being funny; being early is less important than being correct)
    • subjunctives (e.g. might be; critical that something be finished; I wish it were) ⁠ ⁠ ⁠

  2. Then for nouns, both the singular form and the plural form should count as the same basic word [ᴇᴅɪᴛᴏʀ's ɴᴏᴛᴇ: this is frequently referrred to as the citation form of the word.]

For example, with “enable”, I don’t want “enables” and “enabled” printed as separate entries. All three of those should count as the same basic word, the verb enable.

I can prevent printing of duplicates using a hash like:

unless ($seenmatches{ $headmatches[$l] }++)
  1. Could someone explain this? Explained in comments below.

  2. This doesn’t stop the plural/past from continuing on. Is there a way to do this, or some wholely distinct approach, perhaps one involving a regex and/or substitution, then an unsub later?

I can't modify the word with a substitution, because then the print would not print out right. Although I'm not at the stage yet, eventually I'd like to include irregular past tenses [ᴇᴅɪᴛᴏʀ's ɴᴏᴛᴇ: and irregular nouns, too?] as well

Im not sure what else you need to answer my question, so please just let me know anything I’ve unintentionally left out, and I'll fill in any missing bits to help make it clearer.


Solution

  • The way a typical search engine works is as follows:

    • The input string is tokenized, chopped up at word boundaries - a character offset start/end is associated with each token
    • Each token is then stemmed - I'd use Lingua::Stem (or, better, Lingua::Stem::Snowball) which are slightly updated versions of the Porter stemmer
    • Each token, with its original character offset start/end is retained and indexed, usually along with a copy of the original text, before it was tokenized. This is basically a table which associates the term text with its originating document (usually as an identifier)

    Now, when a query arrives, it too is tokenized and each token stemmed, but we don't care about positions this time. We look up each token against those we have indexed, to locate the postings (matching document identifiers). We can now retrieve the stored start/end offsets to determine where the terms were in the original text.

    So, you do lose the suffixes for the index (which is what it used to locate matching documents) but you preserve the original text and offsets for those documents, so you can do query highlighting and nice display stuff should you need to.

    Stemming is definitely the right tool for this job. The main trick is to ensure you treat the query and the documents in the same way. You can modify the original document, but really, you want to transform it into something like a back of book index, not into a string you use regular expressions on -- if you really are doing search engine stuff, that is. Check out the excellent KinoSearch module on CPAN if you like, or look at the Apache Lucene project it was originally derived from.