Search code examples
pythonsentiment-analysisstemminglemmatization

Stemming texts separates words into letters


I am trying to process my text using tokenization, stemming, normalization and stop-word/punctuation removal, etc. When I use snowball stemming technique, my text gets separated into letters with commas in between.

def processed_tweets(text):

  punctuate_text= str.maketrans('', '', string.punctuation+string.digits)
  text = text.translate(punctuate_text)

  tokens = word_tokenize(text)

  stop_words = set(stopwords.words('english'))
  filtered_words = [w for w in tokens if not w in stop_words]

  #applying stemming 
  snow_stemmer = SnowballStemmer(language='english')
  text = [snow_stemmer.stem(word) for word in text]

  return text


tweet_df['processed_tweets'] = tweet_df['Tweet Body'].apply(processed_tweets)
tweet_df.head()

This is the output I am getting:

Output

Following is the output for print(tokens)

enter image description here

This is not the case when using lemmatization though. Is there an issue on how I am writing my code or the technique I am using (stemming vs lemmatization)?


Solution

  • Pretty much a very small misunderstanding on the use of tokenize on my part. Editing how I apply stemming to tokenized words instead of the 'text' string works:

    text = [snow_stemmer.stem(word) for word in filtered_words]