I'm trying to create a tagged corpora in Malagasy (my mother tongue). I followed the instructions in the document Python text Processing and natural language processing and the page https://www.nltk.org/book/ch05.html . I have managed to create my own Part-of-Speech Tagset based on the Universal Part-of-Speech Tagset and a little tagged corpora. This is my code :
import os, os.path
path = os.path.expanduser('D:/Mes documents/MY_POS_tagger/nltk_data')
if not os.path.exists(path):
os.mkdir(path)
print("OS path done :%s"%os.path.exists(path))
import nltk.data
nltk.data.path.append('D:/Mes documents/MY_POS_tagger/nltk_data')
print("NLTK data path done:%s"%(path in nltk.data.path))
#read a POSfile
import nltk
from nltk.corpus.reader import TaggedCorpusReader
from nltk.tag import UnigramTagger
#there's only one document malagasy.pos, it's there where my tagged corpora.
reader = TaggedCorpusReader('D:/Mes documents/MY_POS_tagger/nltk_data/corpora/cookbook', r'.*\.pos')
train_sents=reader.tagged_sents()
tagger=UnigramTagger(train_sents)
#dago.txt contain just sentences without tag, i just wanted to test if the tag i assign on the POS file will work
text=(nltk.data.load('corpora/cookbook/dago.txt', format='raw'))
text_tokenized=nltk.word_tokenize(text)
print tagger.tag(text_tokenized)
I have this result:
OS path done :True
NLTK data path done:True
[('Matory', u'VB'), ('ny', None), ('alika', u'NN')]
So i can see that it's work, but i read in the document above that i have to train my tagger. So i ask if someone can suggest me how i can do that, cause i read i need to pickle a trained tagger and to train and combin Ngram taggers but i don't understand what pickle means or do. And i don't know if what i'm doing now is the correct path to create and exploit a tagged corpora with NLTK. Thank you
i need to pickle a trained tagger and to train and combin Ngram taggers but i don't understand what pickle means or do
As per this part of your question, Pickle is a library in Python that allows to dump and load binary data on/from your hard drive, related to any python object of your choosing.
Info here: https://docs.python.org/3/library/pickle.html
What you were suggested to do is however to take a pre-trained tagger, which would likely belong to another language, and add the ngrams extracted from the tagged corpora in Malagasy that you have built. If you have a sufficiently large corpus of tagged documents in your own language, however, it might be more useful for yourself and for the NLP community to develop a tagger specific for Malagasy. After a quick research I could not find any on the internet, and it would thus be useful to prepare one.