Right now I'm using PLY to implement a parser for a very small subset of English. For instance, I have a lists of names for nouns and small sets of intransitive verbs, transitive verbs, and dative verbs, and I can make sentences out of different combinations of these. However, in my lexer, I am having an issue with efficiently defining the elements belonging to each token. For instance, for the nouns, if the set of names I wish to include is [Harry, Ron, Hermione, Draco, Snape], the only way I could find to assign these values to the token "N" for noun is
tokens = ['N', 'Vi', 'Vt', 'Vd', 'Conj']
t_N = r'Homer|Marge|Bart|Maggie|Lisa|SLH'
But this seems like a very inefficient way of assigning these, and does not leave room for expansion. For instance if I want to add a list of names from a text file to this there is no clean way to do so. Is there a way to define a list as the specification of a token in PLY?
With Ply, the usual solution is to use a lexical function, not a constant. The function's associated regex will match any word (i.e. something like [a-zA-Z]+
). The body of the function can tgen look the word up in a dictionary whose keys are known words and whise values are lexical categories.
There is an example of the dictionary approach at the end of the manual's section on Specification of Tokens.
That will work fine for a simple small subset but you're eventually going to run into the problem that many English words can have more than one grammatical category (eg., words which could be nouns or verbs).