I have a lot of documents which are processed through a spaCy pipeline.
As I would like to understand and trace back issues with single documents, I would like to identify the source document after it has been processed (encoding, text fragments, wrong tagging etc).
Right now, the pipeline only accepts a list of texts, so I cannot pass any additional ID into it.
Is there any way to specify a document ID which is preserved after the spaCy pipeline so that in can be identified afterwards?
You can set custom extensions on each doc and pass docs rather than texts to the pipeline:
import spacy
from spacy.tokens import Doc
Doc.set_extension("id", default=-1)
def get_docs_from_remote(nlp, size):
for i in range(size):
doc = nlp.make_doc(str(i)) # only tokenization
doc._.id = i
yield doc
nlp = spacy.load("en_core_web_sm", disable=["ner", "lemmatizer"])
docs = nlp.pipe(
get_docs_from_remote(
nlp,
size=10,
),
)
for doc in docs:
print(doc._.id, doc.text)