Search code examples
machine-learningnlpartificial-intelligenceterminologytext-segmentation

difference between Tokenization and Segmentation


What is the difference between Tokenization and Segmentation in NLP. I searched about them but I didn't really find any differences .


Solution

  • Short answer: All tokenization is segmentation, but not all segmentation is tokenization.

    Long Answer:
    While segmentation is a more generic concept of splitting the input text, tokenization is a type of segmentation and it is carried out based on a well defined criteria.
    For example - in a hypothetical scenario if all your input sentences are compound sentences of two sub-sentences, then splitting them into two independent sentences can be termed as segmentation (but not tokenization).
    Tokenization is a form of segmentation which is performed on the basis of a semantic criteria or using a token dictionary - e.g. a word or sub-word tokenization, mainly with an intention of assigning them token ids for downstream processing.