Search code examples
pythonmachine-learningnlpstringtokenizer

what does the function of python meant to do,it is basically to find out when and how the function needs to be utilized


def tokenize_corpus(corpus, num_words=-1):
    # Fit a Tokenizer on the corpus
    if num_words > -1:
        tokenizer = Tokenizer(num_words=num_words)
    else:
        tokenizer = Tokenizer()
    tokenizer.fit_on_texts(corpus)
    return tokenizer

What is the function trying to do? I understood the part after "else" but before that I am unable to understand it can someone explain it please.


Solution

  • Tokenizer is a Text tokenization utility class.

    This class allows to vectorize a text corpus, by turning each text into either a sequence of integers or into a vector

    Arguments num_words: the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.

    UPD : Here num_words tokenize_corpus parameter is used as flag, -1 means don't use num words Tokenize parameter, else use [its just bad implementation]

    More here : https://keras.io/api/preprocessing/text/

    Coursera Tutorial [Recommended] : https://www.coursera.org/lecture/natural-language-processing-tensorflow/working-with-the-tokenizer-VEUJX