I build a Plaintext-Corpus and the next step is to lemmatize all my texts. I'm using the WordNetLemmatizer and need the pos_tag for each token in order to do not get the Problem that e.g. loving -> lemma = loving and love -> lemma = love...
The default WordNetLemmatizer-POS-Tag is n (=Noun) i think, but how can i use the pos_tag? I think the expected WordNetLemmatizer-POS-Tag are diffrent to the pos_tag i get. Is there a function or something that can help me?!?!
in this line i think the word_pos is wrong and that's the error-reason
lemma = wordnet_lemmatizer.lemmatize(word,word_pos)
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
corpus_root = 'C:\\Users\\myname\\Desktop\\TestCorpus'
lyrics = PlaintextCorpusReader(corpus_root,'.*')
for fileid in lyrics.fileids():
tokens = word_tokenize(lyrics.raw(fileid))
tagged_tokens = pos_tag(tokens)
for tagged_token in tagged_tokens:
word = tagged_token[0]
word_pos = tagged_token[1]
lemma = wordnet_lemmatizer.lemmatize(word,pos=word_pos)
Additional Question: Is the pos_tag enough for my lemmatization or need i another tagger? My texts are lyrics...
You need to convert the tag from the pos_tagger to one of the four "syntactic categories" that wordnet recognizes, then pass that to the lemmatizer as the word_pos.
From the docs:
Syntactic category: n for noun files, v for verb files, a for adjective files, r for adverb files.