Search code examples
pythonspacywikipediawikidataentity-linking

Can spaCy link only named entities?


Here's excerpt from a (supposedly) funny review of a restaurant:

I'd like to personally shake Mr Tofu's hand. While I cannot medically prove it, I am 100% certain that their soondubu contains undefined healing properties. Some how some way, I always feel better after a meal here. Got a cold? Screw the Nyquil and get the spicy kimchi soondubu.

I'd would like to extract important entities and link them to Wikipedia entities. I've trained spaCy on a small sample of Wikipedia/WikiData and run entity linking on the review:

[('Tofu', 'PERSON', 'Q177378'), 
('Nyquil', 'WORK_OF_ART', 'NIL')]

I'd like other entities to be extracted and linked as well, e.g.:

kimchi -> Kimchi
cold -> Common cold
healing -> medicine 
medically -> medicine

It looks like spaCy can link only named entities. I've tried to explicitly list other entities as named (which obviously does not scale well):

ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "kimchi"}, {"label": "ORG", "pattern": "cold"}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

However, spaCy does not seem to link new entities at all:

[ ('Tofu', 'PERSON', 'Q177378'),
  ('cold', 'ORG', ''),
  ('Nyquil', 'WORK_OF_ART', 'NIL'),
  ('kimchi', 'ORG', '')]
  1. How can I make Spacy recognize also other entities?
  2. Should this be done before training entity linking model or can be done with already trained model?
  3. Is spaCy the right tool for my task at all?

Solution

  • In theory it's possible. First, you'll need to make sure you have a component that tags these kind of entities. You could train an NER model for this, but be aware that its performance might not be as good on things like "cold" than it would be for actual named entities like "London".

    To create the Knowledge Base and the Entity Linker from Wikipedia/Wikidata, the example scripts are not limited to named entities - they attempt to parse anything that appears in an intra-wiki link. If the word "cold" gets linked to the page "Common cold", it should be able to learn it. The exact entities that are stored in the KB and that are used for training the EL model, depend on which entities are found by your entity recognizer component. So if you adjust that according to your use-case, the entity linking component will follow automatically.