Search code examples
pythondataframelemmatization

How to lemmatise a dataframe column Python


How can lemmatise a dataframe column. CSV file "train.csv" looks like this

id  tweet
1   retweet if you agree
2   happy birthday your majesty
3   essential oils are not made of chemicals

I performed the following

import pandas as pd
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

train_data = pd.read_csv('train.csv', error_bad_lines=False)
print(train_data)

# Removing stop words
stop = stopwords.words('english')
test = pd.DataFrame(train_data['tweet'])
test.columns = ['tweet']

test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
print(test['tweet_without_stopwords'])

# TOKENIZATION
tt = TweetTokenizer()
test['tokenised_tweet'] = test['tweet_without_stopwords'].apply(tt.tokenize)
print(test)

output:

0 retweet if you agree ... [retweet, agree]
1 happy birthday your majesty ... [happy, birthday, majesty]
2 essential oils are not made of chemicals ... [essential, oils, made, chemicals]


I tried the following to lemmatise but I'm getting this error TypeError: unhashable type: 'list'


lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in test['tokenised_tweet']]]
print(lemmatized)


Solution

  • I would do the calculation on the dataframe itself:

    changing:

    lmtzr = WordNetLemmatizer()
    lemmatized = [[lmtzr.lemmatize(word) for word in test['tokenised_tweet']]]
    print(lemmatized)
    
    lmtzr = WordNetLemmatizer()
    test['lemmatize'] = test['tokenised_tweet'].apply(
                        lambda lst:[lmtzr.lemmatize(word) for word in lst])
    

    full code:

    from io import StringIO
    import pandas as pd
    data=StringIO(
    """id;tweet
    1;retweet if you agree
    2;happy birthday your majesty
    3;essential oils are not made of chemicals"""
    )
    test = pd.read_csv(data,sep=";")
    
    import pandas as pd
    from nltk.tokenize import TweetTokenizer
    from nltk.corpus import stopwords
    from nltk.stem.wordnet import WordNetLemmatizer
    
    # Removing stop words
    stop = stopwords.words('english')
    
    test['tweet_without_stopwords'] = test['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
    print(test['tweet_without_stopwords'])
    
    # TOKENIZATION
    tt = TweetTokenizer()
    test['tokenised_tweet'] = test['tweet_without_stopwords'].apply(tt.tokenize)
    print(test)
    
    lmtzr = WordNetLemmatizer()
    test['lemmatize'] = test['tokenised_tweet'].apply(
                        lambda lst:[lmtzr.lemmatize(word) for word in lst])
    print(test['lemmatize'])
    

    output

    0                    [retweet, agree]
    1          [happy, birthday, majesty]
    2    [essential, oil, made, chemical]
    Name: lemmatize, dtype: object