I'm trying to employ NER to extract keywords (tags) from job postings. This can be anything along with React, AWS, Team Building, Marketing
.
After training a custom model in SpaCy I'm presented with a problem - extracted tags are not unified/normalized across all of the data.
For example, if job posting is about frontend development
, NER can extract the keyword frontend
in many ways (depending on job description), for example: Frontend
, Front End
, Front-End
, front-end
and so on.
Is there a reliable way to normalise/unify the extracted keywords? All the keywords go directly into the database and, with all the variants of each keyword, I would end up with too much noise.
One way to tackle the problem would be to create mappings such as:
"Frontend": ["Front End", "Front-End", "front-end"]
but that approach seems not too bright. Perhaps within SpaCy itself there's an option to normalise tags?
Certainly these simple rules can quickly help you to collapse similar s
strings:
s.lower()
s.replace("-", " ")
s.replace(" ", "")
There are several phonetic algorithms such as Metaphone, that are good at collapsing "sounds alike" variants into a single base entity.
A frequent bi-gram analysis may help you to identify common two-word phrases that denote a single entity.
Spacy's token.lemma_
and token.text
can help with stemming.
Learning that e.g. "React" and "Frontend" are more or less synonyms in this context would require a heavier weight approach, such as word2vec, WordNet, or a LLM like ChatGPT.