I have a data frame of tweets and I'm trying to calculate Tf-Idf on the lemmatized 'tweet' column. I have a problem with the results of the lemmatization and I'm getting an error when trying to calculate the Tf-Idf.
Below is my code:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer=WordNetLemmatizer()
def lemmatize_text(tweet):
return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(tweet)]
df['tweet_lemmatized'] = df['tweet'].apply(lemmatize_text)
This is an example of the data frame with the new column 'tweet_lemmatized':
target tweet_lemmatized
0 [believe, department, year, released, hoping]
1 [huge, expected, tomorrow, night, beginning]
It didn't work well because there are words like 'hoping', 'beginning' in the column.
My first question- How can I improve the lemmatization?
Now I want to calculate the Tf- Idf for this column and produce new columns in my data frame with the top words.
This is my code for the Tf-Idf: I want to add the top words to my original data frame 'df'.
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
tfidf= TfidfVectorizer(ngram_range=(1,2), max_features=100,
stop_words=ENGLISH_STOP_WORDS).fit(df.tweet_lemmatized)
tfidf_tweet = tfidf.transform(df.tweet_lemmatized)
result=pd.DataFrame(tfidf_tweet.toarray(), columns=tfidf.get_feature_names())
This is the error I got:
AttributeError: 'list' object has no attribute 'lower'
TfidfVectorizer.fit
takes string input not list(your df.tweet_lemmatized
data should contain strings not lists).
For the better lemmatization, you can use nltk.pos_tag
to get parts of speech and then lemmatize words based on their tag, for example:
lemmatizer.lemmatize(word, 'v')
This way it will lemmatize word
considering it is a verb.