Search code examples
kerasnlptokenizestemminglemmatization

Does keras-tokenizer perform the task of lemmatization and stemming?


Does keras tokenizer provide the functions such as stemming and lemmetization? If it does, then how is it done? Need an intuitive understanding. Also, what does text_to_sequence do in that?


Solution

  • There might be some confusion what a tokenizer does respectively what tokenization is. Tokenization splits a string into smaller entities such as words or single characters. Therefore, these are also referred to as tokens. Wikipedia provides a nice example:

    The quick brown fox jumps over the lazy dog becomes:

    <sentence>
      <word>The</word>
      <word>quick</word>
      ...
      <word>dog</word>
    </sentence>
    

    Lemmatization (grouping together the inflected forms of a word -> link) or stemming (process of reducing inflected (or sometimes derived) words to their word stem -> link) is something you do during preprocessing. Tokenization can be a part of a preprocessing process before or after (or both) lemmatization and stemming.

    Anyhow, Keras is no framework for fully fletched text-preprocessing. Hence, you feed already cleaned, lemmatized etc. data into Keras. Regarding your first question: No, Keras does not provide such functionallity like lemmatization or stemming.

    What Keras understands under Text preprocessing like here in the docs is the functionallity to prepare data in order to be fed to a Keras-model (like a Sequential model). This is for example why the Keras-Tokenizer does this:

    This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

    By for example vectorizing your input strings and transforming them into numeric data you can feed them as input to a, in case of Keras, neural network.

    What text_to_sequence means can be extracted from this: [...]sequence of integers (each integer being the index of a token in a dictionary)[...]. This means that your former strings can afterwards be a sequence (e.g. array) of numeric integers instead of actual words.

    Regarding this you should also take a look on what Keras Sequential models are (e.g. here) since they take seuqences as input.

    Additionally, text_to_word_sequence() (docs) also provides such tokenization, but does not vectorize your data into numeric vectors and returns an array of your tokenized strings.

    Converts a text to a sequence of words (or tokens).