Python NLTK exercise: chapter 5

Hy guys, I'm starting to study NLTK following the official book from the NLTK team.

I'm in chapter 5-"Tagging"- and I can't resolve one of the excercises at page 186 of the PDF version:

Given the list of past participles specified by cfd2['VN'].keys(), try to collect a list of all the word-tag pairs that immediately precede items in that list.

I tried this way:

wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True)

[wsj[wsj.index((word,tag))-1:wsj.index((word,tag))+1] for (word,tag) in wsj if word in cfd2['VN'].keys()]

but it gives me this error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 401, in iterate_from
for tok in piece.iterate_from(max(0, start_tok-offset)):
File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 295, in iterate_from
self._stream.seek(filepos)
AttributeError: 'NoneType' object has no attribute 'seek'

I think I'm doing something wrong in accessing the wsj structure, but I can't figure out what is wrong!

Can you help me?

Thanks in advance!

Solution

wsj is of type nltk.corpus.reader.util.ConcatenatedCorpusView that behaves like a list (this is why you can use functions like index()), but "behind the scenes" NLTK never reads the whole list into memory, it will only read those parts from a file object that it needs. It seems that if you iterate over a CorpusView object and use index() (which requires iterating again) at the same time, the file object will return None.

This way it works, though it is less elegant than a list comprehension:

  for i in range(len(wsj)):
    if wsj[i][0] in cfd2['VN'].keys():
      print wsj[(i-1):(i+1)]