Identify documents after processing in spaCy pipeline?

I have a lot of documents which are processed through a spaCy pipeline.

As I would like to understand and trace back issues with single documents, I would like to identify the source document after it has been processed (encoding, text fragments, wrong tagging etc).
Right now, the pipeline only accepts a list of texts, so I cannot pass any additional ID into it.

Is there any way to specify a document ID which is preserved after the spaCy pipeline so that in can be identified afterwards?

Solution

You can set custom extensions on each doc and pass docs rather than texts to the pipeline:

import spacy
from spacy.tokens import Doc

Doc.set_extension("id", default=-1)


def get_docs_from_remote(nlp, size):
    for i in range(size):
        doc = nlp.make_doc(str(i))  # only tokenization
        doc._.id = i
        yield doc


nlp = spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"])
docs = nlp.pipe(
    get_docs_from_remote(
        nlp,
        size=10,
    ),
)

for doc in docs:
    print(doc._.id, doc.text)