Creating custom component in SpaCy

I am trying to create SpaCy pipeline component to return Spans of meaningful text (my corpus comprises pdf documents that have a lot of garbage that I am not interested in - tables, headers, etc.)

More specifically I am trying to create a function that:

takes a doc object as an argument
iterates over the doc tokens
When certain rules are met, yield a Span object

Note I would also be happy with returning a list([span_obj1, span_obj2])

What is the best way to do something like this? I am a bit confused on the difference between a pipeline component and an extension attribute.

So far I have tried:

nlp = English()

Doc.set_extension('chunks', method=iQ_chunker)

####

raw_text = get_test_doc()

doc = nlp(raw_text)

print(type(doc._.chunks))

>>> <class 'functools.partial'>

iQ_chunker is a method that does what I explain above and it returns a list of Span objects

this is not the results I expect as the function I pass in as method returns a list.

Solution

I imagine you're getting a functools partial back because you are accessing chunks as an attribute, despite having passed it in as an argument for method. If you want spaCy to intervene and call the method for you when you access something as an attribute, it needs to be

Doc.set_extension('chunks', getter=iQ_chunker)

Please see the Doc documentation for more details.

However, if you are planning to compute this attribute for every single document, I think you should make it part of your pipeline instead. Here is some simple sample code that does it both ways.

import spacy
from spacy.tokens import Doc

def chunk_getter(doc):
    # the getter is called when we access _.extension_1,
    # so the computation is done at access time
    # also, because this is a getter,
    # we need to return the actual result of the computation
    first_half = doc[0:len(doc)//2]
    secod_half = doc[len(doc)//2:len(doc)]

    return [first_half, secod_half]

def write_chunks(doc):
    # this pipeline component is called as part of the spacy pipeline,
    # so the computation is done at parse time
    # because this is a pipeline component,
    # we need to set our attribute value on the doc (which must be registered)
    # and then return the doc itself
    first_half = doc[0:len(doc)//2]
    secod_half = doc[len(doc)//2:len(doc)]

    doc._.extension_2 = [first_half, secod_half]

    return doc


nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])

Doc.set_extension("extension_1", getter=chunk_getter)
Doc.set_extension("extension_2", default=[])

nlp.add_pipe(write_chunks)

test_doc = nlp('I love spaCy')
print(test_doc._.extension_1)
print(test_doc._.extension_2)

This just prints [I, love spaCy] twice because it's two methods of doing the same thing, but I think making it part of your pipeline with nlp.add_pipe is the better way to do it if you expect to need this output on every document you parse.