When text is tagged by SpaCy, the vertical bar "|" is assigned different POS tags depending on the context, such as "ADV" , "DEL"... While I want "|" to be recognized as "PUNC". Is there a way to force this POS for "|" ?
I tried this command and it didn't work.
nlp.tokenizer.add_special_case('|', [{ORTH: '|', POS: PUNC}])
I would add a simple pipe into the pipeline, right after the tagger
:
def pos_postprocessor_pipe(doc) :
for token in doc :
if token.text == '|':
token.pos_ = 'PUNCT'
return doc
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(pos_postprocessor_pipe, name="pos_postprocessor", after='tagger')