Spacy gives name's as 2 tokens -> name, 's. How could I combine those two tokens? Which rule define the splitting of "'s", infix, or others?
For spacy v2.2.3+, you can use nlp.tokenizer.explain()
to see which tokenizer settings lead to particular tokens:
import spacy
nlp = spacy.blank("en")
nlp.tokenizer.explain("name's")
# [('TOKEN', 'name'), ('SUFFIX', "'s")]
For English, variants of 's
are matched by the suffix_search
setting. You can modify the suffix regex in order to modify this for the tokenizer: https://spacy.io/usage/linguistic-features#native-tokenizer-additions