I am currently trying to extract some texts from sentences with spacy, did some courses about it but it is still a bit blur to me.
I have the following sentence: ZZZ LLC is a limited liability company formed in the UK. XYZ LLC is a limited liability company formed in the UK. XYZ LLC owns a commercial property located in Germany known as ‘rentview’. Mr X owns 21% of XYZ LLC, the remaining 79% are own by ZZZ LLC which is the sole director of XYZ LLC.
What I want to extract are the following:
{"name": "XYZ LLC", "type": "ORG", "country": "UK"},
{"name": "ZZZ LLC", "type": "ORG", "country": "UK"},
{"name": "XYZ LLC", "type": "ORG",
"owns": [{"name": "rentview", "type": "commercial property", "country": "Germany"}],
"owned_by": [
{"name": "X", "type": "PERSON", "percent": 21},
{"name":"ZZZ LLC", "type": "ORG", "percent": 79}
]
}
My approach is first to assign an incorporation country to a company. Then detects owners of companies. Thereafter detects ownerBy of companies. And finally, generate the JSON object.
But I am already blocking about assigning country to company
my code to do this is the following, but I am pretty sure I'm not using the right approach.
Span.set_extension("incorporation_country", default=False)
@Language.component("assign_org_country")
def assign_org_country(doc):
org_entities = [ent for ent in doc.ents if ent.label_ == "ORG"]
for ent in org_entities:
head = ent.root.head
if head.lemma_ in ['be']:
for child in head.children:
if child.dep_ == "attr" and child.text == "company" and child.right_edge.ent_type_ == "GPE":
ent._.incorporation_country = child
print(f"country of {ent.text} is {ent._.incorporation_country}")
return doc
Any ideas or tips of how to achieve this?
The best approach to the problem is to use the spaCy's powerful built-in EntityRuler to create custom rules that can detect and assign countries to companies automatically.
Here is an example using EntityRuler:
custom_patterns = [
{"label": "ORG", "pattern": [
{ "LOWER": {"IN": ["company", "corporation", "incorporated",
"llc", "limited liability corporation"]}},
{ "IS_PUNCT": True },
{ "ENT_TYPE": "GPE"}
]}
]
ruler = EntityRuler(nlp)
ruler.add_patterns(custom_patterns)
nlp.add_pipe(ruler)
docs = nlp("ZZZ LLC is a limited liability company formed in the UK.")
for ent in docs.ents:
print(f"country of {ent.text} is {ent._.incorporation_country}")
This custom rule will return country for each entity data found you can cusomtize it as per your wish