Search code examples
pythonregextokentokenizespacy

Is it possible to change the token split rules for a Spacy tokenizer?


The (German) spacy tokenizer does not split on slashes, underscores, or asterisks by default, which is just what I need (so "der/die" results in a single token).

However it does split on parentheses so "dies(und)das" gets split into 5 tokens. Is there a (simple) way to tell the default tokeniser to also not split on parentheses which are enclosed by letters on both sides without a space?

How exactly are those splits on parentheses defined for a tokenizer?


Solution

  • The split on parentheses is defined in this line, where it splits on a parenthesis between two letters:

    https://github.com/explosion/spaCy/blob/23ec07debdd568f09c7c83b10564850f9fa67ad4/spacy/lang/de/punctuation.py#L18

    There's no simple way to remove infix patterns, but you can define a custom tokenizer that does what you want. One way is to copy the infix definition from spacy/lang/de/punctuation.py and modify it:

    import re
    import spacy
    from spacy.tokenizer import Tokenizer
    from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
    from spacy.lang.de.punctuation import _quotes
    from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
    
    def custom_tokenizer(nlp):
        infixes = (
            LIST_ELLIPSES
            + LIST_ICONS
            + [
                r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
                r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
                r'(?<=[{a}])[:<>=](?=[{a}])'.format(a=ALPHA),
                r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
                r"(?<=[{a}])([{q}\]\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
                r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
                r"(?<=[0-9])-(?=[0-9])",
            ]
        )
    
        infix_re = compile_infix_regex(infixes)
    
        return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                    suffix_search=nlp.tokenizer.suffix_search,
                                    infix_finditer=infix_re.finditer,
                                    token_match=nlp.tokenizer.token_match,
                                    rules=nlp.Defaults.tokenizer_exceptions)
    
    
    nlp = spacy.load('de')
    nlp.tokenizer = custom_tokenizer(nlp)