Stemming full strings on Python

I need to perform stemming on portuguese strings. To do so, i'm tokening the string using nltk.word_tokenize() function a then stemming each word individually. After that, I rebuild the string. It's working, but not performing well. How can i make it faster? The string length is about 2 million words.

    tokenAux=""
    tokens = nltk.word_tokenize(portugueseString)
        for token in tokens:
            tokenAux = token
            tokenAux = stemmer.stem(token)    
            textAux = textAux + " "+ tokenAux
    print(textAux)

Sorry for bad english and thanks!

Solution

string is immutable so, it is not good practice to update string every time if the string is long. The link here explains various ways to concatenate string and shows performance analysis. And since, the iteration is done only once, it is good to choose generator expression over list comprehension. For details you can look into discussion here . Instead in this case, using generator expression with join can be helpful:

Using my_text for long string: len(my_text) -> 444399

Using timeit to compare:

%%timeit
tokenAux=""
textAux=""
tokens = nltk.word_tokenize(my_text)
for token in tokens:
    tokenAux = token
    tokenAux = stemmer.stem(token)    
    textAux = textAux + " "+ tokenAux

Result:

1 loop, best of 3: 6.23 s per loop

Using generator expression with join:

%%timeit 
' '.join(stemmer.stem(token) for token in nltk.word_tokenize(my_text))

Result:

1 loop, best of 3: 2.93 s per loop