Search code examples
c#algorithminformation-retrievalstemmingporter-stemmer

Porter stemmer algorithm in information-retrieval


I need to create simple search engine for my application. Let's simplify it to the following: we have some texts (a lot) and i need to search and show relevant results.

I've based on this great article extend some things and it works pretty well for me.

But i have problem with stemming words to terms. For example words "annotation", "annotations" etc. will be stemmed to "annot", but imagine you try search something, and you will see unexpected results:

  • "anno" - nothing
  • "annota" - nothing etc.

Only word "annot" will give relevant result. So, how should i improve my search to give expected results? Because "annot" contains "anno" and "annota" is slightly more than "annot". Using contains all the time obviously isn't the solution

If in first case i can use some Ternary search tree, in second case i don't know what to do.

Any ideas would be very helpful.

UPDATE

oleksii has pointed me to n-grams here, which may works for me, but i don't know how to properly index n-grams.

So the Question:

  • Which data structure would be the best for my needs
  • How properly index my n-grams

Solution

  • Stemming perhaps isn't much relevant here. Stemming will convert a plural to a singular form.

    Given you have a tokeniser, a stemmer and a cleaner (to remove stop words, perhaps punctuation and numbers, short words etc) what you are looking at is a full-text search. I would advice you to get an off-the-shelf solution (like Elasticsearch, Lucene, Solr), but if you fancy a DIY approach I can suggest the following naive implementation.

    Step 1
    Create a search-orientated tokeniser. One example would be an n-gram tokeniser. It will take your word and split into the following sequences:

    annotation
    1 - [a, n, o, t, a, i]
    2 - [an, nn, no, ot, ...]
    3 - [ann, nno, not, ota, ...]
    4 - [anno, nnot, nota, otat, ...]
    ....
    

    Step 2
    Sort n-grams for more efficient look-up

    Step 3
    Search n-grams for exact match using binary search