Search code examples
pythonnltkcorpuspos-tagger

Create and exploit a tagged corpora with NLTK


I'm trying to create a tagged corpora in Malagasy (my mother tongue). I followed the instructions in the document Python text Processing and natural language processing and the page https://www.nltk.org/book/ch05.html . I have managed to create my own Part-of-Speech Tagset based on the Universal Part-of-Speech Tagset and a little tagged corpora. This is my code :

        import os, os.path
        path = os.path.expanduser('D:/Mes documents/MY_POS_tagger/nltk_data')
        if not os.path.exists(path):
            os.mkdir(path)
        print("OS path done :%s"%os.path.exists(path))


        import nltk.data
        nltk.data.path.append('D:/Mes documents/MY_POS_tagger/nltk_data')
        print("NLTK data path done:%s"%(path in nltk.data.path))

        #read a POSfile
        import nltk
        from nltk.corpus.reader import TaggedCorpusReader
        from nltk.tag import UnigramTagger

   #there's only one document malagasy.pos, it's there where my tagged corpora.

    reader = TaggedCorpusReader('D:/Mes documents/MY_POS_tagger/nltk_data/corpora/cookbook', r'.*\.pos')


    train_sents=reader.tagged_sents()
    tagger=UnigramTagger(train_sents)

#dago.txt contain just sentences without tag, i just wanted to test if the tag i assign on the POS file will work 

    text=(nltk.data.load('corpora/cookbook/dago.txt', format='raw'))
    text_tokenized=nltk.word_tokenize(text)
    print tagger.tag(text_tokenized)

I have this result:

OS path done :True
NLTK data path done:True
[('Matory', u'VB'), ('ny', None), ('alika', u'NN')]

So i can see that it's work, but i read in the document above that i have to train my tagger. So i ask if someone can suggest me how i can do that, cause i read i need to pickle a trained tagger and to train and combin Ngram taggers but i don't understand what pickle means or do. And i don't know if what i'm doing now is the correct path to create and exploit a tagged corpora with NLTK. Thank you


Solution

  • i need to pickle a trained tagger and to train and combin Ngram taggers but i don't understand what pickle means or do

    As per this part of your question, Pickle is a library in Python that allows to dump and load binary data on/from your hard drive, related to any python object of your choosing.

    Info here: https://docs.python.org/3/library/pickle.html

    What you were suggested to do is however to take a pre-trained tagger, which would likely belong to another language, and add the ngrams extracted from the tagged corpora in Malagasy that you have built. If you have a sufficiently large corpus of tagged documents in your own language, however, it might be more useful for yourself and for the NLP community to develop a tagger specific for Malagasy. After a quick research I could not find any on the internet, and it would thus be useful to prepare one.