Search code examples
pythonnlpnltktaggingpos-tagger

Why is one set of tagged not parsing?


So I'm supposed to chunk some tagged sentences from the WSJ corpus using my very simple parser. When i pos tag the sentences myself it works...but using their given way to get the tagged sentences do not.

My assignment told me to use sentences 200–220 of the tagged WSJ corpus nltk.corpus.treebank.tagged_sents(). however my parser is giving me an error.

My code which works(Manually tagging the sentences which works):

tbss = concat(treebank.sents()[200:220])
tag1 = nltk.pos_tag(tbss)
print(cp.parse(tag1))

Using their code which doesn't work:

tag2 = nltk.corpus.treebank.tagged_sents()[200:220]
print(cp.parse(tag2))
>>> ValueError: chunk structures must contain tagged tokens or trees

Why exactly is the second one giving that error? I did a print of both tag 1 and tag 2 and they look almost identical...so why is one parsing and not the other...am I doing something wrong?


Solution

  • You get an error because you pass cp.parse() a list of sentences, not a list of tagged tokens. You don't show where concat comes from, but clearly (as @lenz commented) it concatenates the sentences into a single list of words. To do the same in the second case, you'd need cp.parse(concat(tag2)).

    However, this is incorrect unless you have a very unusual grammar. Parsers work on one sentence at a time, so you should keep your sentences separate, not concatenate them together. Either iterate over your lists of sentences and parse each one, or parse all the tagged sentences at once with cp.parse_sents(tag2). The same applies to the self-tagged treebank sentences, which should have been tagged and parsed like this:

    tbss = treebank.sents()[200:220]
    tag1 = nltk.pos_tag_sents(tbss)
    parsed1 = cp.parse_sents(tag1)
    for sent in parsed1:
        print(sent)