I want to stem early modern English text:
sb.stem("loveth")
>>> "lov"
Apparently, all I need to do is a small tweak to the Snowball Stemmer:
And to put the endings into the English stemmer, the list
ed edly ing ingly
of Step 1b should be extended to
ed edly ing ingly est eth
As far as the Snowball scripts are concerned, the endings 'est' 'eth' must be added against ending 'ing'.
Great, so I just have to change the variables. Perhaps add a special rule to deal with "thee"/"thou"/"you" and "shalt"/"shall". The NLTK documentation show the variables as:
class nltk.stem.snowball.EnglishStemmer(ignore_stopwords=False)
Bases: nltk.stem.snowball._StandardStemmer
The English Snowball stemmer.
Variables:
__vowels – The English vowels.
__double_consonants – The English double consonants.
__li_ending – Letters that may directly appear before a word final ‘li’.
__step0_suffixes – Suffixes to be deleted in step 0 of the algorithm.
__step1a_suffixes – Suffixes to be deleted in step 1a of the algorithm.
__step1b_suffixes – Suffixes to be deleted in step 1b of the algorithm. (Here we go)
__step2_suffixes – Suffixes to be deleted in step 2 of the algorithm.
__step3_suffixes – Suffixes to be deleted in step 3 of the algorithm.
__step4_suffixes – Suffixes to be deleted in step 4 of the algorithm.
__step5_suffixes – Suffixes to be deleted in step 5 of the algorithm.
__special_words – A dictionary containing words which have to be stemmed specially. (I can stick my "thee"/"thou" and "shalt" issues here)
Now, dumb question. How do I change the variable? Everywhere I've looked for the variables, I keep getting "object has no attribute"...
Try:
>>> from nltk.stem import snowball
>>> stemmer = snowball.EnglishStemmer()
>>> stemmer.stem('thee')
u'thee'
>>> dir(stemmer)
['_EnglishStemmer__double_consonants', '_EnglishStemmer__li_ending', '_EnglishStemmer__special_words', '_EnglishStemmer__step0_suffixes', '_EnglishStemmer__step1a_suffixes', '_EnglishStemmer__step1b_suffixes', '_EnglishStemmer__step2_suffixes', '_EnglishStemmer__step3_suffixes', '_EnglishStemmer__step4_suffixes', '_EnglishStemmer__step5_suffixes', '_EnglishStemmer__vowels', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_r1r2_standard', '_rv_standard', 'stem', 'stopwords', 'unicode_repr']
>>> stemmer._EnglishStemmer__special_words
{u'exceeds': u'exceed', u'inning': u'inning', u'exceed': u'exceed', u'exceeding': u'exceed', u'succeeds': u'succeed', u'succeeded': u'succeed', u'skis': u'ski', u'gently': u'gentl', u'singly': u'singl', u'cannings': u'canning', u'early': u'earli', u'earring': u'earring', u'bias': u'bias', u'tying': u'tie', u'exceeded': u'exceed', u'news': u'news', u'herring': u'herring', u'proceeds': u'proceed', u'succeeding': u'succeed', u'innings': u'inning', u'proceeded': u'proceed', u'proceed': u'proceed', u'dying': u'die', u'outing': u'outing', u'sky': u'sky', u'andes': u'andes', u'idly': u'idl', u'outings': u'outing', u'ugly': u'ugli', u'only': u'onli', u'proceeding': u'proceed', u'lying': u'lie', u'howe': u'howe', u'atlas': u'atlas', u'earrings': u'earring', u'cosmos': u'cosmos', u'canning': u'canning', u'succeed': u'succeed', u'herrings': u'herring', u'skies': u'sky'}
>>> stemmer._EnglishStemmer__special_words['thee'] = 'thou'
>>> stemmer.stem('thee')
'thou'
And:
>>> stemmer._EnglishStemmer__step0_suffixes
(u"'s'", u"'s", u"'")
>>> stemmer._EnglishStemmer__step1a_suffixes
(u'sses', u'ied', u'ies', u'us', u'ss', u's')
>>> stemmer._EnglishStemmer__step1b_suffixes
(u'eedly', u'ingly', u'edly', u'eed', u'ing', u'ed')
>>> stemmer._EnglishStemmer__step2_suffixes
(u'ization', u'ational', u'fulness', u'ousness', u'iveness', u'tional', u'biliti', u'lessli', u'entli', u'ation', u'alism', u'aliti', u'ousli', u'iviti', u'fulli', u'enci', u'anci', u'abli', u'izer', u'ator', u'alli', u'bli', u'ogi', u'li')
>>> stemmer._EnglishStemmer__step3_suffixes
(u'ational', u'tional', u'alize', u'icate', u'iciti', u'ative', u'ical', u'ness', u'ful')
>>> stemmer._EnglishStemmer__step4_suffixes
(u'ement', u'ance', u'ence', u'able', u'ible', u'ment', u'ant', u'ent', u'ism', u'ate', u'iti', u'ous', u'ive', u'ize', u'ion', u'al', u'er', u'ic')
>>> stemmer._EnglishStemmer__step5_suffixes
(u'e', u'l')
Note that the step suffixes are tuples and are immutable so you can't append or add to them like the special words, you would have to "copy" and cast to list and append to it, then overwrite it, e.g.:
>>> from nltk.stem import snowball
>>> stemmer = snowball.EnglishStemmer()
>>> stemmer._EnglishStemmer__step1b_suffixes
[u'eedly', u'ingly', u'edly', u'eed', u'ing', u'ed', 'eth']
>>> step1b = stemmer._EnglishStemmer__step1b_suffixes
>>> stemmer._EnglishStemmer__step1b_suffixes = list(step1b) + ['eth']
>>> stemmer.stem('loveth')
u'love'