Search code examples
python-2.7tokenizespacydependency-parsing

Does spacy take as input a list of tokens?


I would like to use spacy's POS tagging, NER, and dependency parsing without using word tokenization. Indeed, my input is a list of tokens representing a sentence, and I would like to respect the user's tokenization. Is this possible at all, either with spacy or any other NLP package ?

For now, I am using this spacy-based function to put a sentence (a unicode string) in the Conll format:

import spacy
nlp = spacy.load('en')
def toConll(string_doc, nlp):
   doc = nlp(string_doc)
   block = []
   for i, word in enumerate(doc):
          if word.head == word:
                  head_idx = 0
          else:
                  head_idx = word.head.i - doc[0].i + 1
          head_idx = str(head_idx)
          line = [str(i+1), str(word), word.lemma_, word.tag_,
                      word.ent_type_, head_idx, word.dep_]
          block.append(line)
   return block
conll_format = toConll(u"Donald Trump is the new president of the United States of America")

Output:
[['1', 'Donald', u'donald', u'NNP', u'PERSON', '2', u'compound'],
 ['2', 'Trump', u'trump', u'NNP', u'PERSON', '3', u'nsubj'],
 ['3', 'is', u'be', u'VBZ', u'', '0', u'ROOT'],
 ['4', 'the', u'the', u'DT', u'', '6', u'det'],
 ['5', 'new', u'new', u'JJ', u'', '6', u'amod'],
 ['6', 'president', u'president', u'NN', u'', '3', u'attr'],
 ['7', 'of', u'of', u'IN', u'', '6', u'prep'],
 ['8', 'the', u'the', u'DT', u'GPE', '10', u'det'],
 ['9', 'United', u'united', u'NNP', u'GPE', '10', u'compound'],
 ['10', 'States', u'states', u'NNP', u'GPE', '7', u'pobj'],
 ['11', 'of', u'of', u'IN', u'GPE', '10', u'prep'],
 ['12', 'America', u'america', u'NNP', u'GPE', '11', u'pobj']]

I would like to do the same while having as input a list of tokens...


Solution

  • You can run Spacy's processing pipeline against already tokenised text. You need to understand, though, that the underlying statistical models have been trained on a reference corpus that has been tokenised using some strategy and if your tokenisation strategy is significantly different, you may expect some performance degradation.

    Here's how to go about it using Spacy 2.0.5 and Python 3. If using Python 2, you may need to use unicode literals.

    import spacy; nlp = spacy.load('en_core_web_sm')
    # spaces is a list of boolean values indicating if subsequent tokens
    # are followed by any whitespace
    # so, create a Spacy document with your tokenisation
    doc = spacy.tokens.doc.Doc(
        nlp.vocab, words=['nuts', 'itch'], spaces=[True, False])
    # run the standard pipeline against it
    for name, proc in nlp.pipeline:
        doc = proc(doc)