Search code examples
c++nlpporter-stemmer

A confusion about the porter stemming algorithm


I am trying to implement porter stemming algorithm, but I stumbled at this point

where the square brackets denote arbitrary presence of their contents. Using (VC){m} to denote VC repeated m times, this may again be written as

[C](VC){m}[V].

m will be called the \measure\ of any word or word part when represented in this form. The case m = 0 covers the null word. Here are some examples:

m=0    TR,  EE,  TREE,  Y,  BY.
m=1    TROUBLE,  OATS,  TREES,  IVY.
m=2    TROUBLES,  PRIVATE,  OATEN,  ORRERY.

I don't understand what is this "measure" and what does it stand for?


Solution

  • Looks like the measure is the number of times a vowel is immediately followed by a consonant. For example,

    "TROUBLES" has:

    Optional initial consonants [C] = "TR".

    First vowels-consonants group (VC) = "OUBL".

    Second vowels-consonants group (VC) = "ES".

    Optional ending vowels [V] is empty.

    So the measure is two, the number of times (VC) was "matched".