Using the methods defined in the NLTK book, I want to create a parse tree of a sentence that has already been POS tagged. From what I understand from the chapter linked above, any words you want to be able to recognize need to be in the grammar. This seems ridiculous, seeing as there's a built in POS tagger that would make hand-writing the parts of speech for each word completely redundant. Am I missing some functionality of the parsing methods that allows for this?
These are two different kinds of technology involved here. The chapter you link to is about hand-written context-free grammars, which typically have a few dozen rules and can handle a tiny subset of English (or any other language you cover). While it is possible to create a large-coverage system on a very large number of such rules (plus other technologies), the CFG implementation in the NLTK is only intended for teaching or demonstration purposes-- put differently, it's a toy. Don't even think about using it for general-purpose parsing.
For parsing real text, there are probabilistic parsers like the Stanford parser (for which the nltk has an interface in nltk.parse.stanford
). Such parsers are generally trained on large treebanks, they can handle unknown words, and as you would expect they either take POS-tagged text as input, or do their own POS tagging.
All this said, it's not hard to tweak the NLTK's CFG machinery to handle unknown words, if you have reason to do that: Write grammars over POS tags rather than over words (e.g., you'd write NP => "DT" "NN"
, so that the POS tags are the terminals); then extract the POS tags from your tagged sentence, build a parse tree over them, and put the words back in. (This won't be enough if your CFG contains rules that mix terminals and non-terminals, like "give" NP "to" NP
.)