Search code examples
nlpspacy

Concatenate two spacy docs together?


How do I concatenate two spacy docs together? To merge them into one?

import spacy

nlp = spacy.load('en')
doc1 = nlp(u'This is the doc number one.')
doc2 = nlp(u'And this is the doc number two.')
new_doc = doc1+doc2

Of course that will return an error as a doc object is not concatenable by default. Is there a straightforward solution to do that?

I looked at this: https://github.com/explosion/spaCy/issues/2229 The issue seems closed so it sounds like they have implemented a solution but I cannot find a simple example of that being used.


Solution

  • What about this:

    import spacy
    from spacy.tokens import Doc
    
    nlp = spacy.blank('en')
    doc1 = nlp(u'This is the doc number one.')
    doc2 = nlp(u'And this is the doc number two.')
    
    # Will work for few Docs, but see further recommendations below
    docs=[doc1, doc2]
    
    # `c_doc` is your "merged" doc
    c_doc = Doc.from_docs(docs)
    print("Merged text: ", c_doc.text)
    
    # Some quick checks: should not trigger any error.
    assert len(list(c_doc.sents)) == len(docs)
    assert [str(ent) for ent in c_doc.ents] == [str(ent) for doc in docs for ent in doc.ents]
    

    For "a lot" of different sentences, it might be better to use nlp.pipe as shown in the documentation.

    Hope it helps.