Search code examples
pythonregexnlptokenizespacy

spacy tokenize apostrophe


I am trying to properly split words to fit my corpus. I'm already using this approach which fixes hyphenated words, what I can't seem to figure out is how to keep words with apostrophes for contractions like: can't, won't, don't, he's, etc. together as one token in spacy.

More specifically I am searching how to do this for Dutch words: zo'n, auto's, massa's, etc. but this problem should be language-independent.

I have the following tokenizer:

def custom_tokenizer(nlp):
    prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
    suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
    infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\'\`\“\”\"\'~]''')

    return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                     suffix_search=suffix_re.search,
                     infix_finditer=infix_re.finditer,
                     token_match=None)
nlp = spacy.load('nl_core_news_sm')
nlp.tokenizer = custom_tokenizer(nlp)

with this the tokens I get are:

'Mijn','eigen','huis','staat','zo',"'",'n','zes','meter','onder','het','wateroppervlak','van','de','Noordzee','.'

...but the tokens I expected should be:

'Mijn','eigen','huis','staat',"zo'n",'zes','meter','onder','het','wateroppervlak','van','de','Noordzee','.'

I know it is possible to add custom rules like:

case = [{ORTH: "zo"}, {ORTH: "'n", LEMMA: "een"}]
tokenizer.add_special_case("zo'n",case)

But I am looking for a more general solution.

I've tried editing the infix_re regex from the other thread, but I doesn't seem to have any impact on the issue. Is there any setting or change I can do to fix this?


Solution

  • There is a very recent Work in Progress in spaCy to fix these type of lexical forms for Dutch. More information in today's Pull Request: https://github.com/explosion/spaCy/pull/3409

    More specifically, nl/punctuation.py (https://github.com/explosion/spaCy/pull/3409/files#diff-84f02ed25ff9e44641672ca0ba5c1839) shows how this can be solved by altering the suffixes:

    enter image description here