Search code examples
tokenizespacy

How does spacy split "'s"?


Spacy gives name's as 2 tokens -> name, 's. How could I combine those two tokens? Which rule define the splitting of "'s", infix, or others?


Solution

  • For spacy v2.2.3+, you can use nlp.tokenizer.explain() to see which tokenizer settings lead to particular tokens:

    import spacy
    nlp = spacy.blank("en")
    
    nlp.tokenizer.explain("name's")
    # [('TOKEN', 'name'), ('SUFFIX', "'s")]
    

    For English, variants of 's are matched by the suffix_search setting. You can modify the suffix regex in order to modify this for the tokenizer: https://spacy.io/usage/linguistic-features#native-tokenizer-additions