I am using Python 3.9.7 and the spaCy
library and want to change the way the model segments a given sentence. Here is a sentence and the segmentation rule I created as an example:
import spacy
nlp=spacy.load('en_core_web_sm')
doc2=nlp(u'"Management is doing the right things; leadership is doing the right things." -Peter Drucker')
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text==";":
doc[token.i +1].is_sent_start=True
return doc
nlp.add_pipe(set_custom_boundaries, before='parser')
However, this produces the error message below:
ValueError Traceback (most recent call last)
C:\Users\SEYDOU~1\AppData\Local\Temp/ipykernel_21000/1705623728.py in <module>
----> 1 nlp.add_pipe(set_custom_boundaries, before='parser')
~\Anaconda3\lib\site-packages\spacy\language.py in add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
777 bad_val = repr(factory_name)
778 err = Errors.E966.format(component=bad_val, name=name)
--> 779 raise ValueError(err)
780 name = name if name is not None else factory_name
781 if name in self.component_names:
ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <function set_custom_boundaries at 0x000002520A59CCA0> (name: 'None').
- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.
- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.
- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.
I looked at some solutions online, however, I could not solve the problem as a beginner in Python. How does one use their own custom segmentation rule in the spaCy pipeline?
The syntax of
nlp.add_pipe
with a custom function is given here. You must (1) declare the component function with a 'decorator' and (2) pass the name of the component/function as a string. So it should be something like this:
@Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text==";":
doc[token.i +1].is_sent_start=True
return doc
nlp.add_pipe("set_custom_boundaries", before='parser')
Note: your function is doing a strange sentence segmentation, it won't work in general. For example it won't work if a sentence ends with '.', '...', or '!', etc.