Search code examples
pythonnlpnltkstanford-nlppos-tagger

Unknown symbol in nltk pos tagging for Arabic


I have used nltk to tokenize some arabic text

However, i ended up with some results like

(u'an arabic character/word', '``') or (u'an arabic character/word', ':')

However, they do not provide the `` or : in the documentation.

hence i would like to find out what is this

from nltk.toeknize.punkt import PunktWordTokenizer 

z = "أنا تسلق شجرة"
tkn = PunkWordTokenizer
sen = tkn.tokenize(z)
tokens = nltk.pos_tag(sent)

print tokens

Solution

  • The default NLTK POS tag is trained on English texts and is supposedly for English text processing, see http://www.nltk.org/_modules/nltk/tag.html. The docs:

    An off-the-shelf tagger is available.  It uses the Penn Treebank tagset:
    
        >>> from nltk.tag import pos_tag  # doctest: +SKIP
        >>> from nltk.tokenize import word_tokenize # doctest: +SKIP
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
        'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
        ('.', '.')]
    

    And the code for pos_tag:

    from nltk.data import load
    
    
    # Standard treebank POS tagger
    _POS_TAGGER = 'taggers/maxent_treebank_pos_tagger/english.pickle'
    def pos_tag(tokens):
        """
        Use NLTK's currently recommended part of speech tagger to
        tag the given list of tokens.
    
            >>> from nltk.tag import pos_tag # doctest: +SKIP
            >>> from nltk.tokenize import word_tokenize # doctest: +SKIP
            >>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) # doctest: +SKIP
            [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
            'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
            ('.', '.')]
    
        :param tokens: Sequence of tokens to be tagged
        :type tokens: list(str)
        :return: The tagged tokens
        :rtype: list(tuple(str, str))
        """
        tagger = load(_POS_TAGGER)
        return tagger.tag(tokens)
    

    This works for me to get Stanford tools working in python on Ubuntu 14.4.1:

    $ cd ~
    $ wget http://nlp.stanford.edu/software/stanford-postagger-full-2015-01-29.zip
    $ unzip stanford-postagger-full-2015-01-29.zip
    $ wget http://nlp.stanford.edu/software/stanford-segmenter-2015-01-29.zip
    $ unzip /stanford-segmenter-2015-01-29.zip
    $ python
    

    and then:

    from nltk.tag.stanford import POSTagger
    path_to_model= '/home/alvas/stanford-postagger-full-2015-01-30/models/arabic.tagger'
    path_to_jar = '/home/alvas/stanford-postagger-full-2015-01-30/stanford-postagger-3.5.1.jar'
    
    artagger = POSTagger(path_to_model, path_to_jar, encoding='utf8')
    artagger._SEPARATOR = '/'
    tagged_sent = artagger.tag(u"أنا تسلق شجرة")
    print(tagged_sent)
    

    [out]:

    $ python3 test.py
    [('أ', 'NN'), ('ن', 'NN'), ('ا', 'NN'), ('ت', 'NN'), ('س', 'RP'), ('ل', 'IN'), ('ق', 'NN'), ('ش', 'NN'), ('ج', 'NN'), ('ر', 'NN'), ('ة', 'PRP')]
    

    If you have java problems when using Stanford POS tagger, see DELPH-IN wiki: http://moin.delph-in.net/ZhongPreprocessing