Search code examples
pythonnlpnltkstemmingsnowball

How to Stem Shakespere/KJV Using nltk.stem.snowball


I want to stem early modern English text:

sb.stem("loveth")
>>> "lov"

Apparently, all I need to do is a small tweak to the Snowball Stemmer:

And to put the endings into the English stemmer, the list

ed edly ing ingly

of Step 1b should be extended to

ed edly ing ingly est eth

As far as the Snowball scripts are concerned, the endings 'est' 'eth' must be added against ending 'ing'.

Great, so I just have to change the variables. Perhaps add a special rule to deal with "thee"/"thou"/"you" and "shalt"/"shall". The NLTK documentation show the variables as:

class nltk.stem.snowball.EnglishStemmer(ignore_stopwords=False)

Bases: nltk.stem.snowball._StandardStemmer

The English Snowball stemmer.

Variables:

__vowels – The English vowels.

__double_consonants – The English double consonants.

__li_ending – Letters that may directly appear before a word final ‘li’.

__step0_suffixes – Suffixes to be deleted in step 0 of the algorithm.

__step1a_suffixes – Suffixes to be deleted in step 1a of the algorithm.

__step1b_suffixes – Suffixes to be deleted in step 1b of the algorithm. (Here we go)

__step2_suffixes – Suffixes to be deleted in step 2 of the algorithm.

__step3_suffixes – Suffixes to be deleted in step 3 of the algorithm.

__step4_suffixes – Suffixes to be deleted in step 4 of the algorithm.

__step5_suffixes – Suffixes to be deleted in step 5 of the algorithm.

__special_words – A dictionary containing words which have to be stemmed specially. (I can stick my "thee"/"thou" and "shalt" issues here)

Now, dumb question. How do I change the variable? Everywhere I've looked for the variables, I keep getting "object has no attribute"...


Solution

  • Try:

    >>> from nltk.stem import snowball
    >>> stemmer = snowball.EnglishStemmer()
    >>> stemmer.stem('thee')
    u'thee'
    >>> dir(stemmer)
    ['_EnglishStemmer__double_consonants', '_EnglishStemmer__li_ending', '_EnglishStemmer__special_words', '_EnglishStemmer__step0_suffixes', '_EnglishStemmer__step1a_suffixes', '_EnglishStemmer__step1b_suffixes', '_EnglishStemmer__step2_suffixes', '_EnglishStemmer__step3_suffixes', '_EnglishStemmer__step4_suffixes', '_EnglishStemmer__step5_suffixes', '_EnglishStemmer__vowels', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_r1r2_standard', '_rv_standard', 'stem', 'stopwords', 'unicode_repr']
    >>> stemmer._EnglishStemmer__special_words
    {u'exceeds': u'exceed', u'inning': u'inning', u'exceed': u'exceed', u'exceeding': u'exceed', u'succeeds': u'succeed', u'succeeded': u'succeed', u'skis': u'ski', u'gently': u'gentl', u'singly': u'singl', u'cannings': u'canning', u'early': u'earli', u'earring': u'earring', u'bias': u'bias', u'tying': u'tie', u'exceeded': u'exceed', u'news': u'news', u'herring': u'herring', u'proceeds': u'proceed', u'succeeding': u'succeed', u'innings': u'inning', u'proceeded': u'proceed', u'proceed': u'proceed', u'dying': u'die', u'outing': u'outing', u'sky': u'sky', u'andes': u'andes', u'idly': u'idl', u'outings': u'outing', u'ugly': u'ugli', u'only': u'onli', u'proceeding': u'proceed', u'lying': u'lie', u'howe': u'howe', u'atlas': u'atlas', u'earrings': u'earring', u'cosmos': u'cosmos', u'canning': u'canning', u'succeed': u'succeed', u'herrings': u'herring', u'skies': u'sky'}
    >>> stemmer._EnglishStemmer__special_words['thee'] = 'thou'
    >>> stemmer.stem('thee')
    'thou'
    

    And:

    >>> stemmer._EnglishStemmer__step0_suffixes
    (u"'s'", u"'s", u"'")
    >>> stemmer._EnglishStemmer__step1a_suffixes
    (u'sses', u'ied', u'ies', u'us', u'ss', u's')
    >>> stemmer._EnglishStemmer__step1b_suffixes
    (u'eedly', u'ingly', u'edly', u'eed', u'ing', u'ed')
    >>> stemmer._EnglishStemmer__step2_suffixes
    (u'ization', u'ational', u'fulness', u'ousness', u'iveness', u'tional', u'biliti', u'lessli', u'entli', u'ation', u'alism', u'aliti', u'ousli', u'iviti', u'fulli', u'enci', u'anci', u'abli', u'izer', u'ator', u'alli', u'bli', u'ogi', u'li')
    >>> stemmer._EnglishStemmer__step3_suffixes
    (u'ational', u'tional', u'alize', u'icate', u'iciti', u'ative', u'ical', u'ness', u'ful')
    >>> stemmer._EnglishStemmer__step4_suffixes
    (u'ement', u'ance', u'ence', u'able', u'ible', u'ment', u'ant', u'ent', u'ism', u'ate', u'iti', u'ous', u'ive', u'ize', u'ion', u'al', u'er', u'ic')
    >>> stemmer._EnglishStemmer__step5_suffixes
    (u'e', u'l')
    

    Note that the step suffixes are tuples and are immutable so you can't append or add to them like the special words, you would have to "copy" and cast to list and append to it, then overwrite it, e.g.:

    >>> from nltk.stem import snowball
    >>> stemmer = snowball.EnglishStemmer()
    >>> stemmer._EnglishStemmer__step1b_suffixes
    [u'eedly', u'ingly', u'edly', u'eed', u'ing', u'ed', 'eth']
    >>> step1b = stemmer._EnglishStemmer__step1b_suffixes 
    >>> stemmer._EnglishStemmer__step1b_suffixes = list(step1b) + ['eth']
    >>> stemmer.stem('loveth')
    u'love'