When using spacy to tokenize a sentence, I want it to not split into tokens on /
Example:
import en_core_web_lg
nlp = en_core_web_lg.load()
for i in nlp("Get 10ct/liter off when using our App"):
print(i)
Output:
Get
10ct
/
liter
off
when
using
our
App
I want it to be like Get , 10ct/liter, off, when ....
I was able to find how to add more ways to split into tokens for spacy, but not how to avoid specific splitting techniques.
I suggest using a custom tokenizer, see Modifying existing rule sets:
import spacy
from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, HYPHENS
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex
nlp = spacy.load("en_core_web_trf")
text = "Get 10ct/liter off when using our App"
# Modify tokenizer infix patterns
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
+ [
r"(?<=[0-9])[+\-\*^](?=[0-9-])",
r"(?<=[{al}{q}])\.(?=[{au}{q}])".format(
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
#r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA),
]
)
infix_re = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_re.finditer
doc = nlp(text)
print([t.text for t in doc])
## => ['Get', '10ct/liter', 'off', 'when', 'using', 'our', 'App']
Note the commented #r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
line, I simply took out the /
char from the [:<>=/]
character class. This rule split at /
that is between a letter/digit and a letter.
If you need to still split '12/ct'
into three tokens, you will need to add another line below the r"(?<=[{a}0-9])[:<>=](?=[{a}])".format(a=ALPHA)
line:
r"(?<=[0-9])/(?=[{a}])".format(a=ALPHA),