Search code examples
pythontf-idflemmatization

Error when using Lemmatization and Tf- Idf calculation on Twitter data frame in Python


I have a data frame of tweets and I'm trying to calculate Tf-Idf on the lemmatized 'tweet' column. I have a problem with the results of the lemmatization and I'm getting an error when trying to calculate the Tf-Idf.

Below is my code:

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer=WordNetLemmatizer()

def lemmatize_text(tweet):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(tweet)]

df['tweet_lemmatized'] = df['tweet'].apply(lemmatize_text)

This is an example of the data frame with the new column 'tweet_lemmatized':

  target       tweet_lemmatized
    0        [believe, department, year, released, hoping]
    1        [huge, expected, tomorrow, night, beginning]

It didn't work well because there are words like 'hoping', 'beginning' in the column.

My first question- How can I improve the lemmatization?

Now I want to calculate the Tf- Idf for this column and produce new columns in my data frame with the top words.

This is my code for the Tf-Idf: I want to add the top words to my original data frame 'df'.

from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS

tfidf= TfidfVectorizer(ngram_range=(1,2), max_features=100,  
       stop_words=ENGLISH_STOP_WORDS).fit(df.tweet_lemmatized)

tfidf_tweet = tfidf.transform(df.tweet_lemmatized)

result=pd.DataFrame(tfidf_tweet.toarray(), columns=tfidf.get_feature_names())

This is the error I got:

AttributeError: 'list' object has no attribute 'lower'

Solution

  • TfidfVectorizer.fit takes string input not list(your df.tweet_lemmatized data should contain strings not lists). For the better lemmatization, you can use nltk.pos_tag to get parts of speech and then lemmatize words based on their tag, for example:

    lemmatizer.lemmatize(word, 'v')
    

    This way it will lemmatize word considering it is a verb.