Search code examples
pythonazurenamed-entity-recognitionpart-of-speechazure-machine-learning-service

Part of speech tagging and entity recognition - python


I want to perform part of speech tagging and entity recognition in python similar to Maxent_POS_Tag_Annotator and Maxent_Entity_Annotator functions of openNLP in R. I would prefer a code in python which takes input as textual sentence and gives output as different features- like number of "CC", number of "CD", number of "DT" etc.. CC, CD, DT are POS tags as used in Penn Treebank. So there should be 36 columns/features for POS tagging corresponding to 36 POS tags as in Penn Treebank POS. I want to implement this on Azure ML "Execute Python Script" module and Azure ML supports python 2.7.7. I heard nltk in python may does the job, but I am a beginner on python. Any help would be appreciated.


Solution

  • Take a look at NTLK book, Categorizing and Tagging Words section.

    Simple example, it uses the Penn Treebank tagset:

    from nltk.tag import pos_tag
    from nltk.tokenize import word_tokenize
    pos_tag(word_tokenize("John's big idea isn't all that bad.")) 
    
    [('John', 'NNP'),
    ("'s", 'POS'),
     ('big', 'JJ'),
     ('idea', 'NN'),
     ('is', 'VBZ'),
     ("n't", 'RB'),
     ('all', 'DT'),
     ('that', 'DT'),
     ('bad', 'JJ'),
     ('.', '.')]
    

    Then you can use

    from collections import defaultdict
    counts = defaultdict(int)
    for (word, tag) in pos_tag(word_tokenize("John's big idea isn't all that bad.")):
        counts[tag] += 1
    

    to get frequencies:

    defaultdict(<type 'int'>, {'JJ': 2, 'NN': 1, 'POS': 1, '.': 1, 'RB': 1, 'VBZ': 1, 'DT': 2, 'NNP': 1})