Search code examples
pythonnlpnltkpos-taggeruniversal-pos-tag

How to use the universal POS tags with nltk.pos_tag() function?


I have a text and I want to find number of 'ADJs','PRONs', 'VERBs', 'NOUNs' etc. I know that there is .pos_tag() function but it gives me different results , and I want to have results as 'ADJ','PRON', 'VERB', 'NOUN'. This is my code:

import nltk
from nltk.corpus import state_union, brown
from nltk.corpus import stopwords
from nltk import ne_chunk

from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer 

from collections import Counter

sentence = "this is my sample text that I want to analyze with programming language"

# tokenizing text (make list with evey word)
sample_tokenization = word_tokenize(sample)
print("THIS IS TOKENIZED SAMPLE TEXT, LIST OF WORDS:\n\n", sample_tokenization)
print()

# tagging words
taged_words = nltk.pos_tag(sample_tokenization.split(' '))
print(taged_words)
print()


# showing the count of every type of word for new text
count_of_word_type = Counter(word_type for word,word_type in taged_words)
count_of_word_type_list = count_of_word_type.most_common() # making a list of tuples counts
print(count_of_word_type_list)


for w_type, num in count_of_word_type_list:
     print(w_type, num)
print() 

The code above works but I want to find a way to get this type of tags:

Tag Meaning English Examples
ADJ adjective   new, good, high, special, big, local
ADP adposition  on, of, at, with, by, into, under
ADV adverb  really, already, still, early, now
CONJ    conjunction and, or, but, if, while, although
DET determiner, article the, a, some, most, every, no, which
NOUN    noun    year, home, costs, time, Africa
NUM numeral twenty-four, fourth, 1991, 14:24
PRT particle    at, on, out, over per, that, up, with
PRON    pronoun he, their, her, its, my, I, us
VERB    verb    is, say, told, given, playing, would
.   punctuation marks   . , ; !
X   other   ersatz, esprit, dunno, gr8, univeristy

I saw that there is a chapter here: https://www.nltk.org/book/ch05.html

That says:

from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')

But I do not know how to apply that on my sample sentence. Thanks for your help.


Solution

  • From https://github.com/nltk/nltk/blob/develop/nltk/tag/init.py#L135

    >>> from nltk.tag import pos_tag
    >>> from nltk.tokenize import word_tokenize
    
    # Default Penntreebank tagset.
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
    [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
    ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
    
    # Universal POS tags.
    >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
    [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
    ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]