Search code examples
pythonperformancenltkspacy

Tokenize text - Very slow when doing it


Question

I have a data frame with +90,000 rows and with a column ['text'] that contains the text of some news.

The length of the text has an average of 3.000 words and when I pass the word_tokenize it makes it very slow, Which could be a more efficent method to do it?

from nltk.tokenize import word_tokenize
df['tokenized_text'] = df.iloc[0:10]['texto'].apply(word_tokenize) 
df.head()

Also word_tokenize hasn't some punctuations and other characters that I don't want, so I created a function to filter them where I'm using spacy.

from spacy.lang.es.stop_words import STOP_WORDS
from nltk.corpus import stopwords
spanish_stopwords = set(stopwords.words('spanish'))
otherCharacters = ['`','�',' ','\xa0']
def tokenize(phrase):
    sentence_tokens = []
    tokenized_phrase = nlp(phrase)
    for token in tokenized_phrase:
        if ~token.is_punct or ~token.is_stop or ~(token.text.lower() in spanish_stopwords) or ~(token.text.lower() in otherCharacters) or ~(token.text.lower() in STOP_WORDS):
            sentence_tokens.append(token.text.lower())
    return sentence_tokens

Any other better method to do it?

Thanks for reading my maybe noob👨🏽‍💻 question😀, have a nice day🌻.

Appreciations

  1. nlp is defined before
import spacy
import es_core_news_sm
nlp = es_core_news_sm.load()
  1. I'm using spacy to tokenize but also using the nltk stop_words for spanish language.

Solution

  • In order to make spacy faster when you only wish to tokenize.
    you can change:

    nlp = es_core_news_sm.load()
    

    To:

    nlp = spacy.load("es_core_news_sm", disable=["tagger", "ner", "parser"])
    

    A small explanation:
    Spacy gives a full language model which not merely tokenize your sentence but also do parsing, and pos and ner tagging. when actually most of the calculation time is being done for the other tasks (parse tree, pos, ner) and not the tokenization which is actually much 'lighter' task, computation wise.
    But, as you can see spacy allow you to use only what you actually need and by that save you some time.

    Another thing, you can make your function more efferent by lowering token only once and add the stop word to spacy (even if you didn't want to do so, the fact that otherCharacters is a list and not a set is not very efficient ).

    I would also add this:

    for w in stopwords.words('spanish'):
        nlp.vocab[w].is_stop = True
    for w in otherCharacters:
        nlp.vocab[w].is_stop = True
    for w in STOP_WORDS:
        nlp.vocab[w].is_stop = True
    

    and than:

    for token in tokenized_phrase:
        if not token.is_punct and  not token.is_stop:
            sentence_tokens.append(token.text.lower())