Search code examples
pythonrakestemming

Python Snowball Stemmer + RAKE: generates 'u's


I am trying to get the keywords from a text file containing a text, and I'm stemming the text first. The code below works, but for some reason it generates the letter 'u' in front of the keyword list. E.g. this is what I get:

[(u'keyword1', 5), (u'keyword2', 4)]

And I'm not sure where the 'u' comes from. Here is the code (after importing the packages):

stemmer = SnowballStemmer("english")
rake_object = rake.Rake("SmartStoplist.txt", 5, 3, 4)
s = open("test.txt", "r").read()
s = re.sub('[^a-zA-Z0-9-_*.]', ' ', s) # Remove special characters that might cause problems with stemming
words = s.split()
stemmed = [stemmer.stem(word) for word in words]
stemmed = ' '.join(stemmed)
keywords = rake_object.run(stemmed) # Perform RAKE on stemmed text
print(keywords)

Solution

  • It means that it is Unicode string, stemmer returns this type of strings. It's been syntax since 2.0, in Pythons 2.x. To get more information, read documentation. Don't worry about it.