Search code examples
python-3.xnlpnamed-entity-recognition

Mapping entities between two disparate company datasets


I have several datasets containing data about companies: - entity_structure (columns: entity_id, parent_entity_id, ultimate_parent_id) - entity_addresses (columns: address_id, entity_id, location_city, state, postal_code, zip, street, ...) - vendor (columns: vendor_id, parent_vendor_id, top_vendor_id, cnt_children, orgtype_id, geo_id, name, email, ...) - geo (columns: geo_id, zipcode, is_primary, latitude, longtitude, elevation, state, ...) - entity_coverage (entity_id, name, proper_name, sic_code, industry_code, sector_code, iso, ...)

I need to automatically map entities between the datasets, for example, there may be a company named "Google" in one data set, and a company named "Google 123" in another. I need to be able to determine with a high enough confidence that those are the same entities. In most cases, the data does not share a unique key. In most cases, the data does not share a unique key.

Would named entity linking be the best approach here? Are there any Python examples on how to approach this problem?


Solution

  • Based on your example, Levenshtein Distance may help.