Search code examples
algorithminformation-retrievalstemming

In the Porter Stemming algorithm, what is the purpose of including an identity rule such as SS -> SS?


What is the point of the Porter Stemmer algorithm having a rule the converts SS to SS?


Solution

  • Imagine the rule SS->SS was not in the algorithm. Then words like caress would not be recognized at all and it would seem that algorithm can't do anything to reduce it to a stem. However, with the rule SS->SS the stemmer says: "I recognize the word caress and I reduce it to caress. I'm done". The alternative would be: "I can't do anything". Of course it is fictitious work but what matters since is that it increases the precision of the stemmer. You can see that when the testing of the algorithm is being done. If this rule was not in the stemmer the results would have been different (worse). Look at the word list [ridiculousness, caress]

    Case 1. Rule SS->SS in the algorithm.

    Stemming:

    caress (Step 1a)-> caress OK
    ridiculousness (Step 2)-> ridiculous (step 4) -> ridicul OK
    Success rate: 100%
    

    Case 2. Rule SS->SS not in the algorithm.

    Stemming:

    caress -> fail OK
    ridiculousness (Step 2)-> ridiculous (step 4) -> ridicul OK
    Success rate: 50%
    

    From practical point of view this rule doesn't matter. It's just a formalism.