Search code examples
pythonnlpgrammarnltk

Combining a Tokenizer into a Grammar and Parser with NLTK


I am making my way through the NLTK book and I can't seem to do something that would appear to be a natural first step for building a decent grammar.

My goal is to build a grammar for a particular text corpus.

(Initial question: Should I even try to start a grammar from scratch or should I start with a predefined grammar? If I should start with another grammar, which is a good one to start with for English?)

Suppose I have the following simple grammar:

simple_grammar = nltk.parse_cfg("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP
VP -> V NP | VP PP
Det -> 'a' | 'A'
N -> 'car' | 'door'
V -> 'has'
P -> 'in' | 'for'
 """);

This grammar can parse a very simple sentence, such as:

parser = nltk.ChartParser(simple_grammar)
trees = parser.nbest_parse("A car has a door")

Now I want to extend this grammar to handle sentences with other nouns and verbs. How do I add those nouns and verbs to my grammar without manually defining them in the grammar?

For example, suppose I want to be able to parse the sentence "A car has wheels". I know that the supplied tokenizers can magically figure out which words are verbs/nouns, etc. How can I use the output of the tokenizer to tell the grammar that "wheels" is a noun?


Solution

  • You could run a POS tagger over your text and then adapt your grammar to work on POS tags instead of words.

    > text = nltk.word_tokenize("A car has a door")
    ['A', 'car', 'has', 'a', 'door']
    
    > tagged_text = nltk.pos_tag(text)
    [('A', 'DT'), ('car', 'NN'), ('has', 'VBZ'), ('a', 'DT'), ('door', 'NN')]
    
    > pos_tags = [pos for (token,pos) in nltk.pos_tag(text)]
    ['DT', 'NN', 'VBZ', 'DT', 'NN']
    
    > simple_grammar = nltk.CFG.fromstring("""
      S -> NP VP
      PP -> P NP
      NP -> Det N | Det N PP
      VP -> V NP | VP PP
      Det -> 'DT'
      N -> 'NN'
      V -> 'VBZ'
      P -> 'PP'
      """)
    
    > parser = nltk.ChartParser(simple_grammar)
    > tree = parser.parse(pos_tags)