Search code examples
pythonpython-3.xnlptokenizespacy

spacy tokenization merges the wrong tokens


I would like to use spacy for tokenizing Wikipedia scrapes. Ideally it would work like this:

text = 'procedure that arbitrates competing models or hypotheses.[2][3] Researchers also use experimentation to test existing theories or new hypotheses to support or disprove them.[3][4]'

# run spacy
spacy_en = spacy.load("en")
doc = spacy_en(text, disable=['tagger', 'ner'])
tokens = [tok.text.lower() for tok in doc]

# desired output
# tokens = [..., 'models', 'or', 'hypotheses', '.', '[2][3]', 'Researchers', ...

# actual output
# tokens = [..., 'models', 'or', 'hypotheses.[2][3', ']', 'Researchers', ...]

The problem is that the 'hypotheses.[2][3]' is glued together into one token.

How can I prevent spacy from connecting this '[2][3]' to the previous token? As long as it is split from the word hypotheses and the point at the end of the sentence, I don't care how it is handled. But individual words and grammar should stay apart from syntactical noise.

So for example, any of the following would be a desirable output:

  • 'hypotheses', '.', '[2][', '3]'
  • 'hypotheses', '.', '[2', '][3]'

Solution

  • I think you could try playing around with infix:

    import re
    import spacy
    from spacy.tokenizer import Tokenizer
    
    infix_re = re.compile(r'''[.]''')
    
    def custom_tokenizer(nlp):
      return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
    
    nlp = spacy.load('en')
    nlp.tokenizer = custom_tokenizer(nlp)
    doc = nlp(u"hello-world! I am hypothesis.[2][3]")
    print([t.text for t in doc])
    

    More on this https://spacy.io/usage/linguistic-features#native-tokenizers