Search code examples
pythonnlptokenspacy

What is the difference between token and span (a slice from a doc) in spaCy?


I would like to know what is the difference between token and span in spaCy.

Also what is the main reason when we have to work with span? Why can't we simply use token to do any NLP? Specially when we use spaCy matcher?

Brief Background: My problem came up when I wanted to get index of span (its exact index in string doc not its ordered index in spaCy doc) after using spaCy matcher which returns 'match_id', 'start' and 'end', and so I could get span out of this information, not a token. Then I needed to create a training_data which requires exact index of word in a sentence. If I had access to token, I could simply use token.idx but span does not have that! So I have to write extra codes to find the index of word (which is the same as span) in its sentence!


Solution

  • Token vs Span

    From spaCy's documentation, a Token represents a single word, punctuation symbol, whitespace, etc. from a document, while a Span is a slice from the document. In other words, a Span is an ordered sequence of Tokens.

    Why Spans?

    spaCy's Matcher gives a Span-level information rather than Token-level, because it allows a sequence of Tokens to be matched. In the same way that a Span can be composed of just 1 Token, this isn't necessarily the case.

    Consider the following example. Where we match for the Token "hello" on its own, the Token "world" on its own, and the Span composed of the Tokens "hello" & "world".

    >>> import spacy
    >>> nlp = spacy.load("en")
    >>> from spacy.matcher import Matcher
    >>> matcher = Matcher(nlp.vocab)
    >>> matcher.add(1, None, [{"LOWER": "hello"}])
    >>> matcher.add(2, None, [{"LOWER": "world"}])
    >>> matcher.add(3, None, [{"LOWER": "hello"}, {"LOWER": "world"}])
    

    For "Hello world!" all of these patterns match:

    >>> document = nlp("Hello world!")
    >>> [(token.idx, token) for token in document]
    [(0, Hello), (6, world), (11, !)]
    >>> matcher(document)
    [(1, 0, 1), (3, 0, 2), (2, 1, 2)]
    

    However, the 3rd pattern doesn't match for "Hello, world!", since "Hello" & "world" aren't contiguous Tokens (because of the "," Token), so they don't form a Span:

    >>> document = nlp("Hello, world!")
    >>> [(token.idx, token) for token in document]
    [(0, Hello), (5, ,), (7, world), (12, !)]
    >>> matcher(document)
    [(1, 0, 1), (2, 2, 3)]
    

    Accessing Tokens from Spans

    Despite this, you should be able to get Token-level information from the span by iterating over the Span, the same way you could iterate over Tokens in a Doc.

    >>> document = nlp("Hello, world!")
    >>> span, type(span)
    (Hello, world, <class 'spacy.tokens.span.Span'>)
    >>> [(token.idx, token, type(token)) for token in span]
    [(0, Hello, <class 'spacy.tokens.token.Token'>), (5, ,, <class 'spacy.tokens.token.Token'>), (7, world, <class 'spacy.tokens.token.Token'>)]