Search code examples
pythonartificial-intelligencespacy

How to build an extracter spacy pipeline


I am currently trying to extract some texts from sentences with spacy, did some courses about it but it is still a bit blur to me.

I have the following sentence: ZZZ LLC is a limited liability company formed in the UK. XYZ LLC is a limited liability company formed in the UK. XYZ LLC owns a commercial property located in Germany known as ‘rentview’. Mr X owns 21% of XYZ LLC, the remaining 79% are own by ZZZ LLC which is the sole director of XYZ LLC.

What I want to extract are the following:

{"name": "XYZ LLC", "type": "ORG", "country": "UK"},
{"name": "ZZZ LLC", "type": "ORG", "country": "UK"},
{"name": "XYZ LLC", "type": "ORG", 
  "owns": [{"name": "rentview", "type": "commercial property", "country": "Germany"}], 
  "owned_by": [
    {"name": "X", "type": "PERSON", "percent": 21},
    {"name":"ZZZ LLC", "type": "ORG", "percent": 79}
  ]
}

My approach is first to assign an incorporation country to a company. Then detects owners of companies. Thereafter detects ownerBy of companies. And finally, generate the JSON object.

But I am already blocking about assigning country to company

my code to do this is the following, but I am pretty sure I'm not using the right approach.

Span.set_extension("incorporation_country", default=False)

@Language.component("assign_org_country")
def assign_org_country(doc):
  org_entities = [ent for ent in doc.ents if ent.label_ == "ORG"]
  for ent in org_entities:
    head = ent.root.head
    if head.lemma_ in ['be']:
      for child in head.children:
        if child.dep_ == "attr" and child.text == "company" and child.right_edge.ent_type_ == "GPE":
          ent._.incorporation_country = child
          print(f"country of {ent.text} is {ent._.incorporation_country}")
  return doc

Any ideas or tips of how to achieve this?


Solution

  • The best approach to the problem is to use the spaCy's powerful built-in EntityRuler to create custom rules that can detect and assign countries to companies automatically.

    Here is an example using EntityRuler:

    custom_patterns = [
        {"label": "ORG", "pattern": [
            { "LOWER": {"IN": ["company", "corporation", "incorporated", 
                               "llc", "limited liability corporation"]}},
            { "IS_PUNCT": True },
            { "ENT_TYPE": "GPE"}
        ]}
    ]
    
    ruler = EntityRuler(nlp)
    ruler.add_patterns(custom_patterns)
    
    nlp.add_pipe(ruler)
    
    docs = nlp("ZZZ LLC is a limited liability company formed in the UK.")
    
    for ent in docs.ents:    
        print(f"country of {ent.text} is {ent._.incorporation_country}")
        
    

    This custom rule will return country for each entity data found you can cusomtize it as per your wish