Search code examples
nlpstemmingporter-stemmer

Is it possible to get a natural word after it has been stemmed?


I have a word play which after stemming has become plai. Now I want to get play again. Is it possible? I have used Porter's Stemmer.


Solution

  • Stemmer is able to process artificial non-existing words. Would you like them to be returned as elements of a set of all possible words? How do you know that the word doesn't exist and shouldn't be returned?

    As an option: find a dictionary of all words and their forms. Find a stem for every of them. Save this projection as a map: ( stem, list of all word forms ). So you'll be able to get the list of all word forms for a given stem.

    UPD: If you need all possible words including non-existing then I can offer such an algorithm (it's not checked, just a suggestion):

    Porter stemming algorithm. We need a reversed version.

    If the rule in straight algorithm has a form (m>1) E -> (delete last E) then the reversed rule would be "fork with E" which means we need to try alternative ways. E.g., in straight algorithm probate -> probat, in reversed we have two alternatives: probat -> { probat, probate }. Each of these alternatives should be separately processed further. Note that this is a set of alternatives, so we will process only distinct words. Such a rule would have the following form: A -> { , B, C }, which means "replace ending A in three alternative ways: leave as-is, with B and with C".

    Step 5b: (m>1) *L -> { , +L } // Add L if there's L at the end.
    Step 5a: (m>1) -> { , +E }
             (m=1 and not *o) -> { , +E } // *o is a special condition, it's not *O.
    Step 4: (m>1) *S or *T -> { , +ION }
            (m>1) -> { , +AL, +ANCE, +ENCE, ..., +IVE, +IZE }
    Step 3: (m>0) *AL -> { , +IZE }
            (m>0) *IC -> { , +ATE, +ITI, +AL }
            (m>0) -> { , +ATIVE, +FUL, +NESS }
    Step 2: (m>0) *ATE -> { , ATIONAL } // Replace ATE.
            (m>0) *TION -> { , +AL } // Add AL at the end.
            (m>0) *ENCE -> { , ENCI } // Replace ENCE.
            ...
            (m>0) *BLE -> { , BILITI } // Replace BLE.
    Step 1c: (*v*) *I -> { , Y } // Replace I.
    Step 1b: (m=1 and *oE) -> { , +D, delete last E and add ING } // *o is a special condition.
             (*v*c and not (*L or *S or *Z)) -> { , add last consonant +ED, add last consonant + ING }
             *IZE -> { , IZING, +D }
             (*v*BLE) -> { , +D, delete last E and add ING }
             *ATE -> { , ATING, +D }
             (*v*) -> { , +ED, +ING }
             (m>0) *EE -> { , +D }
    Step 1a: *I -> { , +ES }
             *SS -> { , +ES }
             not *S -> { , +S }
    

    The straight algorithm had to choose first longest rule. The reversed algorithm should use all the rules.

    Example (straight):

    Input: PLAYING
    Step 1a doesn't match.
    PLAYING -> PLAY (Step 1b)
    PLAY -> PLAI (Step 1c)
    m=0, so the steps 2-5 don't match.
    Result: PLAI
    

    Reversed:

    Input: PLAI
    m=0, so the steps 2-5 are skipped
    Step 1c:
    PLAI -> { PLAI, PLAY }
    Step 1b:
    PLAI -> { PLAI, PLAIED, PLAIING }
    PLAY -> { PLAY, PLAYED, PLAYING }
    Resulting set: { PLAI, PLAIED, PLAIING, PLAY, PLAYED, PLAYING }
    Step 1a:
    PLAI -> { PLAI, PLAIS, PLAIES }
    PLAIED -> { PLAIED, PLAIEDS }
    PLAIING -> { PLAIING, PLAIINGS }
    PLAY -> { PLAY, PLAYS }
    PLAYED -> { PLAYED, PLAYEDS }
    PLAYING -> { PLAYING, PLAYINGS }
    Resulting set: { PLAI, PLAIS, PLAIES, PLAIED, PLAIEDS, PLAIING, PLAIINGS, PLAY, PLAYS, PLAYED, PLAYEDS, PLAYING, PLAYINGS }
    

    I've checked all those words at Michael Tontchev's link. The result for every of them is "plai" (note that the site doesn't accept upper-case input).