In spacy, I'd like characters like '€', '$', or '¥' to be always considered a token. However it seems sometimes they are made part of a bigger token. For example, this is good (two tokens)
>>> len(nlp("100€"))
2
But the following is not what I want (I'd like to obtain two tokens in this case also):
>>> len(nlp("N€"))
1
How could I achieve that with spacy? By the way, don't get too focused on the currency example. I've had this kind of problematic with other kind of characters that have nothing to do with numbers or currencies. The problem is how to make sure a character is always treated as a full token and not glued to some other string in the sentence.
see here.
Spacy’s tokenizer works by iterating over whitespace-separated substrings and looking for things like for prefixes or suffixes, to separate those parts off. You can add custom prefixes and suffixes as explained in the link above.
We can use that as follows:
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp("N€")
print([t for t in doc])
#[N€]
suffixes = nlp.Defaults.suffixes + ("€", )
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search
doc = nlp("N€")
print([t for t in doc])
#[N, €]