Search code examples
pythonregextokenizespacy

How can I get Spacy to stop splitting both hyphenated numbers and words into separate tokens?


Thanks for looking. I am using spaCy to perform Named Entity Recognition on a block of text, and I am having a peculiar problem I can't seem to overcome. Here is a sample code:

from spacy.tokenizer import Tokenizer
nlp = spacy.load("en_core_web_sm")

doc = nlp('The Indo-European Caucus won the all-male election 58-32.')

This results in the following:

['The', 'Indo', '-', 'European', 'Caucus', 'won', 'the', 'all', '-', 'male', 'election', ',', '58', '-', '32', '.']

My problems is that I need those words and numbers that contain hyphens to come through as single tokens. I followed the examples given at this answer by using the following code:

inf = list(nlp.Defaults.infixes)
inf = [x for x in inf if '-|–|—|--|---|——|~' not in x] # remove the hyphen-between-letters pattern from infix patterns
infix_re = compile_infix_regex(tuple(inf))

def custom_tokenizer(nlp):
    return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                suffix_search=nlp.tokenizer.suffix_search,
                                infix_finditer=infix_re.finditer,
                                token_match=nlp.tokenizer.token_match,
                                rules=nlp.Defaults.tokenizer_exceptions)

nlp.tokenizer = custom_tokenizer(nlp)

That helped with the alphabetic characters, and I got this:

['The', 'Indo-European', 'Caucus', 'won', 'the', 'all-male', 'election', ',', '58', '-', '32', '.']

That was much better, but the '58-32' was still split into separate tokens. I tried this answer and got the reverse effect:

['The', 'Indo', '-', 'European', 'Caucus', 'won', 'the', 'all', '-', 'male', 'election', ',' '58-32', '.']

How can I alter the tokenizer to give me the correct results in both circumstances?


Solution

  • You may combine the two solutions:

    import spacy
    from spacy.tokenizer import Tokenizer
    from spacy.util import compile_infix_regex
    
    nlp = spacy.load("en_core_web_sm")
    
    def custom_tokenizer(nlp):
        inf = list(nlp.Defaults.infixes)               # Default infixes
        inf.remove(r"(?<=[0-9])[+\-\*^](?=[0-9-])")    # Remove the generic op between numbers or between a number and a -
        inf = tuple(inf)                               # Convert inf to tuple
        infixes = inf + tuple([r"(?<=[0-9])[+*^](?=[0-9-])", r"(?<=[0-9])-(?=-)"])  # Add the removed rule after subtracting (?<=[0-9])-(?=[0-9]) pattern
        infixes = [x for x in infixes if '-|–|—|--|---|——|~' not in x] # Remove - between letters rule
        infix_re = compile_infix_regex(infixes)
    
        return Tokenizer(nlp.vocab, prefix_search=nlp.tokenizer.prefix_search,
                                    suffix_search=nlp.tokenizer.suffix_search,
                                    infix_finditer=infix_re.finditer,
                                    token_match=nlp.tokenizer.token_match,
                                    rules=nlp.Defaults.tokenizer_exceptions)
    
    nlp.tokenizer = custom_tokenizer(nlp)
    doc = nlp('The Indo-European Caucus won the all-male election 58-32.')
    print([token.text for token in doc]) 
    

    Output:

    ['The', 'Indo-European', 'Caucus', 'won', 'the', 'all-male', 'election', '58-32', '.']