Search code examples
pythonnlpnltkstemming

Stemming full strings on Python


I need to perform stemming on portuguese strings. To do so, i'm tokening the string using nltk.word_tokenize() function a then stemming each word individually. After that, I rebuild the string. It's working, but not performing well. How can i make it faster? The string length is about 2 million words.

    tokenAux=""
    tokens = nltk.word_tokenize(portugueseString)
        for token in tokens:
            tokenAux = token
            tokenAux = stemmer.stem(token)    
            textAux = textAux + " "+ tokenAux
    print(textAux)

Sorry for bad english and thanks!


Solution

  • string is immutable so, it is not good practice to update string every time if the string is long. The link here explains various ways to concatenate string and shows performance analysis. And since, the iteration is done only once, it is good to choose generator expression over list comprehension. For details you can look into discussion here . Instead in this case, using generator expression with join can be helpful:

    Using my_text for long string: len(my_text) -> 444399

    Using timeit to compare:

    %%timeit
    tokenAux=""
    textAux=""
    tokens = nltk.word_tokenize(my_text)
    for token in tokens:
        tokenAux = token
        tokenAux = stemmer.stem(token)    
        textAux = textAux + " "+ tokenAux
    

    Result:

    1 loop, best of 3: 6.23 s per loop
    

    Using generator expression with join:

    %%timeit 
    ' '.join(stemmer.stem(token) for token in nltk.word_tokenize(my_text))
    

    Result:

    1 loop, best of 3: 2.93 s per loop