Search code examples
pythonnlpinverted-indexspacy

inverted index in python with spacy as tokenization and persistent relation to original documents


I want to build an inverted index in python using the great https://spacy.io/ library to tokenize the words.

They provide a great example how to concurrently perform the preprocessing and end up with a nice list of documents ready to be indexed.

texts = [u'One document.', u'...', u'Lots of documents']
# .pipe streams input, and produces streaming output
iter_texts = (texts[i % 3] for i in range(100000000))
for i, doc in enumerate(nlp.pipe(iter_texts, batch_size=50, n_threads=4)):
    assert doc.is_parsed
    if i == 30:
        break
    print(i)
    print(doc)

What I do not understand so far is how to maintain a relationship (file path/ URL) to the original documents using this method, i.e. to store it as an additional attribute with each document.


Solution

  • Here is the solution https://github.com/explosion/spaCy/issues/172

    def gen_items():
        print("Yield 0")
        yield (0, 'Text 0')
        print("Yield 1")
        yield (1, 'Text 1')
        print("Yield 2")
        yield (2, 'Text 2')
    
    gen1, gen2 = itertools.tee(gen_items())
    ids = (id_ for (id_, text) in gen1)
    texts = (text for (id_, text) in gen2)
    docs = nlp.pipe(texts, batch_size=50, n_threads=4)
    for id_, doc in zip(ids, docs):
        print(id_, doc.text)