I am building a lemmatizer in python. As I need it to run in realtime/process fairly large amount of data the processing speed is of the essence. Data: I have all possible suffixes that are linked to all wordtypes that they can be combined with. Additionally I have lemmaforms that are linked to both their wordtype(s) and lemma(s). The program takes a word as input and outputs its lemma. word = lemmafrom + suffix
For example (Note: although the example is given in English I am not building a lemmatizer for English):
word: forbidding
lemmaform: forbidd
suffix: ing
lemma: forbid
My solution:
I have converted the data to (nested) dicts:
suffixdict : {suffix1:[type1,type2, ... , type(n)], suffix2:[type1,type2, ... ,
type(n)]}
lemmaformdict : {lemmaform:{type1:lemma}}
1) Find all possible suffixes and word types that they are linked to. If the longest possible suffix is 3 characters long, the program tries to match 'ing', 'ng', 'n' to the keys in suffixdict. If the key exists it returns a value (a set of wordtypes).
2) For each matching suffix search the lemmaform from the dict. If lemmaform exists it returns the wordtypes.
3) Finally, the program tries to intersect the wordtypes produced in steps 1) ans 2) and if the intersection is sucessful it returns the lemma of the word.
My question: could there be a better solution to my problem from the prespective of speed? (Disregarding the option to keep frequent words and lemmas in the dictionary) Help much appriciated.
This would be a wonderful application for finite state transducers. Why? Because they allow you to do string rewriting efficiently (in time linear to the size of the input). Consider the following s[ia]mple transducer:
It takes a string as input and checks whether there exists a path from the initial state (here, 0) to a final state (10, 12 and 17, respectively) given the sequence of input characters. If it reaches a final state, it produces the appropriate output, e.g. (forbidd, ing) if the input was "forbidding".
I don't know whether you have any background on finite state automata, though. If not, give them a try - it will be worth the effort. :) Tries are a special kind of finite state automaton (the sample transducer above is a trie), so they might be a good start.