machine-learning deep-learning nlp named-entity-recognition

How to use k-means algorithm to do attribute clustering after NER?

I am reading this paper and in 3.2.1 sub-section, first paragraph last three lines,

To map the named entity candidates to the standard attribute names, we employed the k-means algorithm to cluster the identified named entities by computing the cosine similarities between them based on Term Frequency–Inverse Document Frequency (TFIDF)."

Can anyone explain what does that mean? If possible give an example about the implementation scenario.

Solution

I am not completely sure what they mean; the best solution is to directly ask the paper's authors about this. But it seems that the clustering has been performed to do something related to entity linking.

Entity linking is the process of disambiguating the named entities discovered in the text by matching them with the unique identities (e.g. Wikipedia articles or database entries). For example, "Washington" can be linked to the city "Washington, D.C", the state "Washington", or the person "George Washington". On the other hand, the strings "Stanford", "Stanford University", "Leland Stanford Junior University", "LSJU", "Stanford U.", "Stanford uni", "University of Stanford", Stanford.edu", "Stanfurd", and a few more do refer to the same institution. This information is not provided by pure NER models, because they can tell you only that e.g. in I graduated from Stanford U. in 2010, Stanford U is a school - but not that it is some specific school.

You may want to use NEL, because NER model predicts only that "Stanford U" is the name of a educational institution, or that "TeslaMotors" is the name of a company. Then the NEL model predicts that "Stanford U" really means "Stanford University", and "TeslaMotors" really means "Tesla, inc.". So you can think that named entity linking somehow "refines" the recognized entities. It is useful, for example, if you perform some downstream task (e.g. classification of resumes) using the found entities, and "Tesla, inc." is present in the training sample, whereas "TeslaMotors" isn't. In this situation, named entity linking will improve the generalizing ability of the downstream model, because after NEL both entities will be treated exactly the same way.

The authors of the paper, however, don't seem to have the database for all their domain-specific entities (schools, degrees, skills, job position etc.), or don't have a labeled dataset to train a model for entity linking. Therefore, instead of classical entity linking, they just merge similar occurrences of entites into clusters, hoping that the strings that end up in the same cluster do really refer to the same identity.

This approach may seem crude, but it is better than no linking at all, and it can provide a good starting point for manually labelling/linking the clusters and thus creating a dataset for training a supervised model for entity linking.