Search code examples
pythonpython-3.xspacyspacy-3

spaCy custom component function is never called


I am adding a custom component to spaCy but it never gets called:

@Language.component("custom_sentence_boundaries")
def custom_sentence_boundaries(doc):
    print(".")
    for token in doc[:-1]:
        if token.text == "\n":
            doc[token.i + 1].is_sent_start = True
    return doc

nlp = spacy.load("de_core_web_sm")
nlp.add_pipe("custom_sentence_boundaries", after="parser")
nlp.analyze_pipes(pretty=True)
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]

I get a result in sentences and the analyzer does list my component but my custom component seams to have no effect and I never see the dots from the print appearing...

Any ideas?


Solution

  • In the code which you have pasted:

    You are doing :

    nlp = spacy.load("de_core_web_sm")
    

    However, it should be :

    nlp = spacy.load("en_core_web_sm")
    

    I tried to reproduce your code and I got the result

    @Language.component("custom_sentence_boundaries")
    def custom_sentence_boundaries(doc):
        print("...$...")                     # I am printing "...$..." so that it is visible easily 
        for token in doc[:-1]:
            if token.text == "\n":
                doc[token.i + 1].is_sent_start = True
        return doc
    
    nlp = spacy.load("en_core_web_sm")
    nlp.add_pipe("custom_sentence_boundaries", after="parser")
    nlp.analyze_pipes(pretty=True)
    text = ("When Sebastian Thrun started working on self-driving cars at "
            "Google in 2007, few people outside of the company took him "
            "seriously. “I can tell you very senior CEOs of major American "
            "car companies would shake my hand and turn away because I wasn’t "
            "worth talking to,” said Thrun, in an interview with Recode earlier "
            "this week.")
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]
    

    #Output

    (please see at the bottom ...$... is printed and custom_sentence_boundaries is printed after parser as we have stated after="parser" in keyword argument)

    ============================= Pipeline Overview =============================
    
    #   Component                    Assigns               Requires   Scores             Retokenizes
    -   --------------------------   -------------------   --------   ----------------   -----------
    0   tok2vec                      doc.tensor                                          False      
                                                                                                    
    1   tagger                       token.tag                        tag_acc            False      
                                                                                                    
    2   parser                       token.dep                        dep_uas            False      
                                     token.head                       dep_las                       
                                     token.is_sent_start              dep_las_per_type              
                                     doc.sents                        sents_p                       
                                                                      sents_r                       
                                                                      sents_f                       
                                                                                                    
    3   custom_sentence_boundaries                                                       False      
                                                                                                    
    4   attribute_ruler                                                                  False      
                                                                                                    
    5   lemmatizer                   token.lemma                      lemma_acc          False      
                                                                                                    
    6   ner                          doc.ents                         ents_f             False      
                                     token.ent_iob                    ents_p                        
                                     token.ent_type                   ents_r                        
                                                                      ents_per_type                 
    
    ✔ No problems found.
    ...$...