I need to perform stemming on portuguese strings. To do so, i'm tokening the string using nltk.word_tokenize() function a then stemming each word individually. After that, I rebuild the string. It's working, but not performing well. How can i make it faster? The string length is about 2 million words.
tokenAux=""
tokens = nltk.word_tokenize(portugueseString)
for token in tokens:
tokenAux = token
tokenAux = stemmer.stem(token)
textAux = textAux + " "+ tokenAux
print(textAux)
Sorry for bad english and thanks!
string
is immutable so, it is not good practice to update string every time if the string is long. The link here explains various ways to concatenate string and shows performance analysis. And since, the iteration is done only once, it is good to choose generator expression
over list comprehension
. For details you can look into discussion here . Instead in this case, using generator expression
with join
can be helpful:
Using my_text
for long string: len(my_text) -> 444399
Using timeit
to compare:
%%timeit
tokenAux=""
textAux=""
tokens = nltk.word_tokenize(my_text)
for token in tokens:
tokenAux = token
tokenAux = stemmer.stem(token)
textAux = textAux + " "+ tokenAux
Result:
1 loop, best of 3: 6.23 s per loop
Using generator expression
with join
:
%%timeit
' '.join(stemmer.stem(token) for token in nltk.word_tokenize(my_text))
Result:
1 loop, best of 3: 2.93 s per loop