Search code examples
pythonnlpspacynamed-entity-recognition

How to normalise keywords extracted with Named Entity Recognition


I'm trying to employ NER to extract keywords (tags) from job postings. This can be anything along with React, AWS, Team Building, Marketing.

After training a custom model in SpaCy I'm presented with a problem - extracted tags are not unified/normalized across all of the data.

For example, if job posting is about frontend development, NER can extract the keyword frontend in many ways (depending on job description), for example: Frontend, Front End, Front-End, front-end and so on.

Is there a reliable way to normalise/unify the extracted keywords? All the keywords go directly into the database and, with all the variants of each keyword, I would end up with too much noise.

One way to tackle the problem would be to create mappings such as:

"Frontend": ["Front End", "Front-End", "front-end"]

but that approach seems not too bright. Perhaps within SpaCy itself there's an option to normalise tags?


Solution

  • Certainly these simple rules can quickly help you to collapse similar s strings:

    • s.lower()
    • s.replace("-", " ")
    • s.replace(" ", "")

    There are several phonetic algorithms such as Metaphone, that are good at collapsing "sounds alike" variants into a single base entity.

    A frequent bi-gram analysis may help you to identify common two-word phrases that denote a single entity.

    Spacy's token.lemma_ and token.text can help with stemming.

    Learning that e.g. "React" and "Frontend" are more or less synonyms in this context would require a heavier weight approach, such as word2vec, WordNet, or a LLM like ChatGPT.