Search code examples
pythonnlpspacy

Identify documents after processing in spaCy pipeline?


I have a lot of documents which are processed through a spaCy pipeline.

As I would like to understand and trace back issues with single documents, I would like to identify the source document after it has been processed (encoding, text fragments, wrong tagging etc).
Right now, the pipeline only accepts a list of texts, so I cannot pass any additional ID into it.

Is there any way to specify a document ID which is preserved after the spaCy pipeline so that in can be identified afterwards?


Solution

  • You can set custom extensions on each doc and pass docs rather than texts to the pipeline:

    import spacy
    from spacy.tokens import Doc
    
    Doc.set_extension("id", default=-1)
    
    
    def get_docs_from_remote(nlp, size):
        for i in range(size):
            doc = nlp.make_doc(str(i))  # only tokenization
            doc._.id = i
            yield doc
    
    
    nlp = spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"])
    docs = nlp.pipe(
        get_docs_from_remote(
            nlp,
            size=10,
        ),
    )
    
    for doc in docs:
        print(doc._.id, doc.text)