Search code examples
pythonnltktokenize

How do I get the result of every element in the following function


I have a function which return parts of speech of every word in the form of list of tuples. When I execute it, I only get the the result of first element(first tuple). I want to get the result of every element(tuple) in that list. For eg:

get_word_pos("I am watching")

I get the result of this as :

[('I', 'PRP'), ('am', 'VBP'), ('watching', 'VBG')]
'n'

But what I want the result is as follows

"n"
"v"
"v"

The function that I have written contains multiple return statement, that is the reason I am only getting the first element as output. Please if someone could modify my function so that I get the desired output. The code is as follows:

training = state_union.raw("2005-GWBush.txt")
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer(training)

def get_word_pos(word):
    
    sample = word
    
    tokenized = tokenizer.tokenize(sample)
    
    
    for i in tokenized:
        words = nltk.word_tokenize(i)
        tagged = nltk.pos_tag(words)
        print(tagged)
        
    for letter in tagged:
    #print(letter[1])
        if letter[1].startswith('J'):
            return wordnet.ADJ
        elif letter[1].startswith('V'):
            return wordnet.VERB
        elif letter[1].startswith('N'):
            return wordnet.NOUN
        elif letter[1].startswith('R'):
            return wordnet.ADV
        else:
            return wordnet.NOUN
        
    ```

Solution

  • As you iterate over tagged you return a value for the first item. You need to accumulate them. Appending them to a list would be one way of doing it. For example:

    from nltk import word_tokenize, pos_tag
    from nltk.corpus import state_union
    from nltk.tokenize import PunktSentenceTokenizer
    from nltk.corpus import wordnet
    
    training = state_union.raw('2005-GWBush.txt')
    tokenizer = PunktSentenceTokenizer(training)
    
    def get_word_pos(word):
        result = []
        for token in tokenizer.tokenize(word):
            words = word_tokenize(token)
            for t in pos_tag(words):
                match t[1][0]:
                    case 'J':
                        result.append(wordnet.ADJ)
                    case 'V':
                        result.append(wordnet.VERB)
                    case 'R':
                        result.append(wordnet.ADV)
                    case _:
                        result.append(wordnet.NOUN)
        return result
    
    
    print(get_word_pos('I am watching'))
    

    Output:

    ['n', 'v', 'v']