Search code examples
pythonnltkcorpustagged-corpus

Editing the NLTK Corpus


In addition to the corpus that comes with nltk I want to train it with my own corpus that follows the same part of speech rules. How can I find the corpus that it is using, and how can I add my own corpus (in addition, not as a replacement)?

EDIT: Here is the code that I am currently using:

inpy = raw_input("$")
text = nltk.word_tokenize(inpy)
d = nltk.pos_tag(text)

Solution

  • NLTK comes with a substantial number of different corpora. It would help if you specified in more detail which corpus you want to augment. The main English POS corpus in NLTK is the Brown corpus. See also http://www.nltk.org/book/ch05.html as well as http://en.wikipedia.org/wiki/Brown_Corpus and http://www.nltk.org/nltk_data/