Search code examples
tokenizespacy

How to tokenize word with hyphen in Spacy


I want to tokenize bs-it to ["bs","it"] using spacy, as I am using it with rasa. The output which I get from is ["bs-it"]. Can somebody help me with that?


Solution

  • You can add custom rules to spaCy's tokenizer. spaCy's tokenizer treats hyphenated words as a single token. In order to change that, you can add custom tokenization rule. In your case, you want to tokenize an infix i.e. something that occurs in between two words, these are usually hyphens or underscores.

    import re
    import spacy
    from spacy.tokenizer import Tokenizer
    
    infix_re = re.compile(r'[-]')
    
    def custom_tokenizer(nlp):
        return Tokenizer(nlp.vocab,infix_finditer=infix_re.finditer)
    
    nlp = spacy.load("en_core_web_sm")
    nlp.tokenizer = custom_tokenizer(nlp)
    doc = nlp("bs-it")
    print([t.text for t in doc])
    

    Output

    ['bs', '-', 'it']